Real-world Data WranglingΒΆ
In this project, an end-to-end data preparation workflow is implemented using two real-world datasets. The data is collected and extracted, evaluated for quality and structural issues through both automated checks and visual inspection, and cleaned using a transparent, well-justified strategy. The refined data is stored in a selected database or data store, integrated into a unified analysis dataset, and used to answer a targeted research question. Key methodological choices are documented, and the code is annotated to support readability and reproducibility.
!python -m pip install kaggle==1.6.12
Defaulting to user installation because normal site-packages is not writeable
Collecting kaggle==1.6.12
Downloading kaggle-1.6.12.tar.gz (79 kB)
ββββββββββββββββββββββββββββββββββββββββ 79.7/79.7 kB 3.6 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Requirement already satisfied: six>=1.10 in /opt/conda/lib/python3.10/site-packages (from kaggle==1.6.12) (1.16.0)
Collecting certifi>=2023.7.22
Downloading certifi-2025.11.12-py3-none-any.whl (159 kB)
βββββββββββββββββββββββββββββββββββββββ 159.4/159.4 kB 9.7 MB/s eta 0:00:00
Requirement already satisfied: python-dateutil in /opt/conda/lib/python3.10/site-packages (from kaggle==1.6.12) (2.9.0.post0)
Requirement already satisfied: requests in /opt/conda/lib/python3.10/site-packages (from kaggle==1.6.12) (2.29.0)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.10/site-packages (from kaggle==1.6.12) (4.65.0)
Collecting python-slugify
Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Requirement already satisfied: urllib3 in /opt/conda/lib/python3.10/site-packages (from kaggle==1.6.12) (1.26.15)
Requirement already satisfied: bleach in /opt/conda/lib/python3.10/site-packages (from kaggle==1.6.12) (6.1.0)
Requirement already satisfied: webencodings in /opt/conda/lib/python3.10/site-packages (from bleach->kaggle==1.6.12) (0.5.1)
Collecting text-unidecode>=1.3
Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
ββββββββββββββββββββββββββββββββββββββββ 78.2/78.2 kB 16.3 MB/s eta 0:00:00
Requirement already satisfied: charset-normalizer<4,>=2 in /opt/conda/lib/python3.10/site-packages (from requests->kaggle==1.6.12) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in /opt/conda/lib/python3.10/site-packages (from requests->kaggle==1.6.12) (3.4)
Building wheels for collected packages: kaggle
Building wheel for kaggle (setup.py) ... done
Created wheel for kaggle: filename=kaggle-1.6.12-py3-none-any.whl size=102969 sha256=35252a1665ce54934064e043d40d970d45edd3ce256e9a015f30548687f39fad
Stored in directory: /home/student/.cache/pip/wheels/1e/0b/7c/50f8e89c3d2f82838dbd7afeddffbb9357003009ada98216c7
Successfully built kaggle
Installing collected packages: text-unidecode, python-slugify, certifi, kaggle
WARNING: The script slugify is installed in '/home/student/.local/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The script kaggle is installed in '/home/student/.local/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed certifi-2025.11.12 kaggle-1.6.12 python-slugify-8.0.4 text-unidecode-1.3
!pip install --target=/workspace ucimlrepo numpy==1.24.3
Collecting ucimlrepo Using cached ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB) Collecting numpy==1.24.3 Using cached numpy-1.24.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB) Collecting pandas>=1.0.0 Using cached pandas-2.3.3-cp310-cp310-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (12.8 MB) Collecting certifi>=2020.12.5 Using cached certifi-2025.11.12-py3-none-any.whl (159 kB) Collecting python-dateutil>=2.8.2 Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB) Collecting pytz>=2020.1 Using cached pytz-2025.2-py2.py3-none-any.whl (509 kB) Collecting tzdata>=2022.7 Using cached tzdata-2025.2-py2.py3-none-any.whl (347 kB) Collecting six>=1.5 Using cached six-1.17.0-py2.py3-none-any.whl (11 kB) Installing collected packages: pytz, tzdata, six, numpy, certifi, python-dateutil, pandas, ucimlrepo Successfully installed certifi-2025.11.12 numpy-1.24.3 pandas-2.3.3 python-dateutil-2.9.0.post0 pytz-2025.2 six-1.17.0 tzdata-2025.2 ucimlrepo-0.0.7 WARNING: Target directory /workspace/dateutil already exists. Specify --upgrade to force replacement. WARNING: Target directory /workspace/ucimlrepo already exists. Specify --upgrade to force replacement. WARNING: Target directory /workspace/__pycache__ already exists. Specify --upgrade to force replacement. WARNING: Target directory /workspace/pandas already exists. Specify --upgrade to force replacement. WARNING: Target directory /workspace/python_dateutil-2.9.0.post0.dist-info already exists. Specify --upgrade to force replacement. WARNING: Target directory /workspace/tzdata already exists. Specify --upgrade to force replacement. WARNING: Target directory /workspace/numpy-1.24.3.dist-info already exists. Specify --upgrade to force replacement. WARNING: Target directory /workspace/numpy already exists. Specify --upgrade to force replacement. WARNING: Target directory /workspace/six-1.17.0.dist-info already exists. Specify --upgrade to force replacement. WARNING: Target directory /workspace/six.py already exists. Specify --upgrade to force replacement. WARNING: Target directory /workspace/tzdata-2025.2.dist-info already exists. Specify --upgrade to force replacement. WARNING: Target directory /workspace/pandas-2.3.3.dist-info already exists. Specify --upgrade to force replacement. WARNING: Target directory /workspace/pytz already exists. Specify --upgrade to force replacement. WARNING: Target directory /workspace/certifi-2025.11.12.dist-info already exists. Specify --upgrade to force replacement. WARNING: Target directory /workspace/ucimlrepo-0.0.7.dist-info already exists. Specify --upgrade to force replacement. WARNING: Target directory /workspace/numpy.libs already exists. Specify --upgrade to force replacement. WARNING: Target directory /workspace/certifi already exists. Specify --upgrade to force replacement. WARNING: Target directory /workspace/pytz-2025.2.dist-info already exists. Specify --upgrade to force replacement. WARNING: Target directory /workspace/bin already exists. Specify --upgrade to force replacement.
Note: Restart the kernel to use updated package(s).
1. Gather dataΒΆ
In this section, data is collected from two independent sources using two different acquisition methods. After retrieval and initial parsing, the datasets are aligned and combined into a single analysis-ready dataset that will be used in the subsequent cleaning and analysis steps.
1.1. Problem StatementΒΆ
1.2. Gathering at least two datasets using two different data gathering methodsΒΆ
List of data gathering methods:
- Download data manually
- Programmatically downloading files
- Gather data by accessing APIs
- Gather and extract data from HTML files using BeautifulSoup
- Extract data from a SQL database
Dataset 1 - WHO Alcohol - Related Disease Mortality (Programmatic Gathering)ΒΆ
Type: JSON data retrieved from an online API
Source: World Health Organization (WHO) - Global Health Observatory (GHO) API
Method: WHO's public API was used to programmatically query the dataset. The dataset was gathered programmatically by sending an HTTP GET request to the WHO Global Health Observatory API. First, the list of WHO indicators was queried to locate the corrrect indicator code. The code for Alcohol-Related disease mortality (per 100000 population) is:
- IndicatorCode: SA_0000001473
Then the dataset was retrieved from the API endpoint. https://ghoapi.azureedge.net/api/SA_0000001473. The JSON response was normalized into tabular form using pandas.jason_normalize(). Finally, the dataset was filtered to include only country-level observations, as regional or global aggregates were not relevant for the analysis.
import requests
import pandas as pd
# requests --> to call the API
# pandas --> to work with tables
BASE_URL = "https://ghoapi.azureedge.net/api"
# Qery the indicator list endpoint to locate the correct indicator code
indicators_url = f"{BASE_URL}/Indicator"
response = requests.get(indicators_url)
response.raise_for_status() # check error occurance
indicators_json = response.json()
len(indicators_json["value"])
3047
# Turn the JSON into a DataFrame
indicators_df = pd.json_normalize(indicators_json["value"])
indicators_df.head()
indicators_df.columns
Index(['IndicatorCode', 'IndicatorName', 'Language'], dtype='object')
# Explore Indicators by keyword
# search "life expectancy"
indicators_df[indicators_df["IndicatorName"].str.contains("life expectancy", case = False)]
| IndicatorCode | IndicatorName | Language | |
|---|---|---|---|
| 2001 | WHOSIS_000007 | Healthy life expectancy (HALE) at age 60 (years) | EN |
| 2012 | WHOSIS_000001 | Life expectancy at birth (years) | EN |
| 2024 | WHOSIS_000002 | Healthy life expectancy (HALE) at birth (years) | EN |
| 2045 | WHOSIS_000015 | Life expectancy at age 60 (years) | EN |
indicators_df[indicators_df["IndicatorName"].str.contains("mortality", case=False)]
| IndicatorCode | IndicatorName | Language | |
|---|---|---|---|
| 169 | MORTADO | Adolescent mortality rate (per 1 000 age speci... | EN |
| 248 | imr | Infant mortality rate (deaths per 1000 live bi... | EN |
| 332 | MDG_0000000007 | Under-five mortality rate (probability of dyin... | EN |
| 375 | MDG_0000000001 | Infant mortality rate (probability of dying be... | EN |
| 402 | MDG_0000000032 | Maternal mortality ratio (per 100 000 live bir... | EN |
| 457 | nmr | Neonatal mortality rate (deaths per 1000 live ... | EN |
| 578 | MDG_0000000026 | Maternal mortality ratio (per 100 000 live bir... | EN |
| 889 | SA_0000001473 | Alcohol-related disease mortality, per 100,000... | EN |
| 899 | GHE_YLLNUM | Years of life lost from mortality (YLLs) | EN |
| 900 | GHE_YLLRATE | Years of life lost from mortality (YLLs) (per ... | EN |
| 1105 | SA_0000001472 | Alcohol-related injury mortality, per 1,000 | EN |
| 1492 | WHS2_161 | Age-standardized mortality rate by cause (per ... | EN |
| 1524 | WHS2_131 | Age-standardized NCD mortality rate (per 100 ... | EN |
| 1526 | WHS2_160 | Age-standardized mortality rate by cause (per ... | EN |
| 1749 | SDGWSHBOD | Mortality rate attributed to exposure to unsaf... | EN |
| 1793 | u5mr | Under-five mortality rate (deaths per 1000 liv... | EN |
| 1862 | SDGPOISON | Mortality rate attributed to unintentional poi... | EN |
| 1863 | SDGROADAGE | Age-standardized road traffic mortality (per ... | EN |
| 1990 | WHOSIS_000003 | Neonatal mortality rate (per 1000 live births) | EN |
| 2013 | WHOSIS_000004 | Adult mortality rate (probability of dying bet... | EN |
| 2051 | WHS10_4 | Number of national population surveys - child ... | EN |
| 2077 | CHILDMORT5TO14 | Mortality rate for 5-14 year-olds (probability... | EN |
| 2321 | WHS10_5 | Number of national population surveys - matern... | EN |
| 2451 | GHE_YLL_NUMERIC | Years of life lost from mortality (YLLs) | EN |
| 2489 | NCD_CCS_MORT_TARGET | Existence of a national target on NCD mortality | EN |
| 2496 | PRISON_D3_DEATHS_DRUG_MRATE | In-prison drug overdose mortality rate (per 10... | EN |
| 2523 | WHOSIS_000016 | Mortality rate among children ages 5 to 9 year... | EN |
| 2649 | PRISON_D3_DEATHS_SUICIDE_MRATE | In-prison suicide mortality rate (per 100 000 ... | EN |
| 2704 | PRISON_D3_DEATHS_COVID_MRATE | In-prison COVID-19 mortality rate (per 100 000... | EN |
| 2708 | CHILDMORT10TO19 | Adolescent mortality rate (per 1 000 age speci... | EN |
| 2745 | CHILDMORT_MORTALITY_10TO14 | Mortality rate among children ages 10 to 14 ye... | EN |
| 2998 | MALARIA_EST_MORTALITY | Estimated malaria mortality rate (per 100 000 ... | EN |
# Request data for the chosen indicator
indicator_code = "SA_0000001473" # Alcohol related disease mortality, per 100000
indicator_url = f"{BASE_URL}/{indicator_code}"
# send a GET request
response = requests.get(indicator_url)
response.raise_for_status() # throws an error if request failed
indicator_json = response.json()
len(indicator_json["value"]) # will show how many rows (records) we got from the API
358
# Turn data into a DataFrame
# normalize the JSON response
who_raw_df = pd.json_normalize(indicator_json["value"])
who_raw_df.head()
| Id | IndicatorCode | SpatialDimType | SpatialDim | ParentLocationCode | TimeDimType | ParentLocation | Dim1Type | Dim1 | TimeDim | ... | DataSourceDim | Value | NumericValue | Low | High | Comments | Date | TimeDimensionValue | TimeDimensionBegin | TimeDimensionEnd | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6284754 | SA_0000001473 | COUNTRY | SWE | EUR | YEAR | Europe | SEX | SEX_BTSX | 1997 | ... | None | 12.8 | 12.8 | None | None | None | 2013-06-11T14:15:34+02:00 | 1997 | 1997-01-01T00:00:00+01:00 | 1997-12-31T00:00:00+01:00 |
| 1 | 385071 | SA_0000001473 | COUNTRY | LTU | EUR | YEAR | Europe | SEX | SEX_MLE | 1990 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 1990 | 1990-01-01T00:00:00+01:00 | 1990-12-31T00:00:00+01:00 |
| 2 | 1375206 | SA_0000001473 | COUNTRY | DEU | EUR | YEAR | Europe | SEX | SEX_MLE | 1991 | ... | None | 30.7 | 30.7 | None | None | None | 2013-06-11T14:15:34+02:00 | 1991 | 1991-01-01T00:00:00+01:00 | 1991-12-31T00:00:00+01:00 |
| 3 | 2333151 | SA_0000001473 | COUNTRY | LTU | EUR | YEAR | Europe | SEX | SEX_FMLE | 1993 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 1993 | 1993-01-01T00:00:00+01:00 | 1993-12-31T00:00:00+01:00 |
| 4 | 4299804 | SA_0000001473 | COUNTRY | NOR | EUR | YEAR | Europe | SEX | SEX_FMLE | 2003 | ... | None | 4.2 | 4.2 | None | None | None | 2013-06-11T14:15:34+02:00 | 2003 | 2003-01-01T00:00:00+01:00 | 2003-12-31T00:00:00+01:00 |
5 rows Γ 25 columns
# Filter the dataset to country-level observations
who_alcohol = who_raw_df[who_raw_df["SpatialDimType"] == "COUNTRY"].copy()
who_alcohol["SpatialDimType"].value_counts()
who_alcohol.head()
| Id | IndicatorCode | SpatialDimType | SpatialDim | ParentLocationCode | TimeDimType | ParentLocation | Dim1Type | Dim1 | TimeDim | ... | DataSourceDim | Value | NumericValue | Low | High | Comments | Date | TimeDimensionValue | TimeDimensionBegin | TimeDimensionEnd | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6284754 | SA_0000001473 | COUNTRY | SWE | EUR | YEAR | Europe | SEX | SEX_BTSX | 1997 | ... | None | 12.8 | 12.8 | None | None | None | 2013-06-11T14:15:34+02:00 | 1997 | 1997-01-01T00:00:00+01:00 | 1997-12-31T00:00:00+01:00 |
| 1 | 385071 | SA_0000001473 | COUNTRY | LTU | EUR | YEAR | Europe | SEX | SEX_MLE | 1990 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 1990 | 1990-01-01T00:00:00+01:00 | 1990-12-31T00:00:00+01:00 |
| 2 | 1375206 | SA_0000001473 | COUNTRY | DEU | EUR | YEAR | Europe | SEX | SEX_MLE | 1991 | ... | None | 30.7 | 30.7 | None | None | None | 2013-06-11T14:15:34+02:00 | 1991 | 1991-01-01T00:00:00+01:00 | 1991-12-31T00:00:00+01:00 |
| 3 | 2333151 | SA_0000001473 | COUNTRY | LTU | EUR | YEAR | Europe | SEX | SEX_FMLE | 1993 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 1993 | 1993-01-01T00:00:00+01:00 | 1993-12-31T00:00:00+01:00 |
| 4 | 4299804 | SA_0000001473 | COUNTRY | NOR | EUR | YEAR | Europe | SEX | SEX_FMLE | 2003 | ... | None | 4.2 | 4.2 | None | None | None | 2013-06-11T14:15:34+02:00 | 2003 | 2003-01-01T00:00:00+01:00 | 2003-12-31T00:00:00+01:00 |
5 rows Γ 25 columns
Dataset 2ΒΆ
Type: Excel file (.xlsx) manually downloaded from the United Nations website
Source: United Nations Department of Economic and Social Affairs (UN DESA)
Method: Manual download from the official UN World Population Prospects (WPP) Download Center
The dataset contains annual demographic indicators from 1950s to 2023 for all UN-recognized countries and regions. It includes population estimates, mortality, fertility, migration, and several demographic measures.
Dataset variables:
Population density
Male/Female population
Births, deaths, crude birth rate, crude death rate
Life expectancy indicators (at birth, age 15, age 65, age 80)
Fertility indicators
Mortality indicators (infant, under-five, age-specific)
Migration indicators
Why these two datasets were selected?ΒΆ
Alcohol-related disease mortality is fundamentally population-normalized indicator. Population size is necessary to interpret mortality risk meaningfully across countries and years. Without demographic context, raw mortality values or inconsistent indicators can lead to misleading comparisons. Therefore, combining WHO mortality data with UN population estimates allows standardized, comparable public health insights.
!pip install openpyxl
Defaulting to user installation because normal site-packages is not writeable
Collecting openpyxl
Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
βββββββββββββββββββββββββββββββββββββββ 250.9/250.9 kB 5.2 MB/s eta 0:00:00a 0:00:01
Collecting et-xmlfile
Downloading et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-2.0.0 openpyxl-3.1.5
import site, sys
# KullanΔ±cΔ±ya ΓΆzel site-packages dizinini bul
user_site = site.getusersitepackages()
print("User site-packages:", user_site)
# Bu dizini sys.path'e ekle
if user_site not in sys.path:
sys.path.append(user_site)
# Εimdi openpyxl'Δ± import etmeyi dene
import openpyxl
print("openpyxl version:", openpyxl.__version__)
User site-packages: /home/student/.local/lib/python3.10/site-packages openpyxl version: 3.1.5
#FILL IN 2nd data gathering and loading method
# Load the gathered excel File
xls = pd.ExcelFile("WPP2024_GEN_F01_DEMOGRAPHIC_INDICATORS_COMPACT.xlsx", engine="openpyxl")
xls.sheet_names
['Estimates', 'Medium variant', 'NOTES']
# 1) Sheet'i ham haliyle, baΕlΔ±ksΔ±z oku
raw = pd.read_excel(
"WPP2024_GEN_F01_DEMOGRAPHIC_INDICATORS_COMPACT.xlsx",
sheet_name="Estimates",
header=None,
engine="openpyxl"
)
# Δ°lk 40 satΔ±rΔ± zaten gΓΆrdΓΌk, Εimdi header satΔ±rΔ±nΔ± otomatik bulalΔ±m:
header_row = raw.index[raw.iloc[:, 0] == "Index"][0]
header_row
16
pop_df = pd.read_excel(
"WPP2024_GEN_F01_DEMOGRAPHIC_INDICATORS_COMPACT.xlsx",
sheet_name="Estimates",
header=header_row,
engine="openpyxl"
)
pop_df.head()
pop_df.columns
Index(['Index', 'Variant', 'Region, subregion, country or area *', 'Notes',
'Location code', 'ISO3 Alpha-code', 'ISO2 Alpha-code', 'SDMX code**',
'Type', 'Parent code', 'Year',
'Total Population, as of 1 January (thousands)',
'Total Population, as of 1 July (thousands)',
'Male Population, as of 1 July (thousands)',
'Female Population, as of 1 July (thousands)',
'Population Density, as of 1 July (persons per square km)',
'Population Sex Ratio, as of 1 July (males per 100 females)',
'Median Age, as of 1 July (years)',
'Natural Change, Births minus Deaths (thousands)',
'Rate of Natural Change (per 1,000 population)',
'Population Change (thousands)', 'Population Growth Rate (percentage)',
'Population Annual Doubling Time (years)', 'Births (thousands)',
'Births by women aged 15 to 19 (thousands)',
'Crude Birth Rate (births per 1,000 population)',
'Total Fertility Rate (live births per woman)',
'Net Reproduction Rate (surviving daughters per woman)',
'Mean Age Childbearing (years)',
'Sex Ratio at Birth (males per 100 female births)',
'Total Deaths (thousands)', 'Male Deaths (thousands)',
'Female Deaths (thousands)',
'Crude Death Rate (deaths per 1,000 population)',
'Life Expectancy at Birth, both sexes (years)',
'Male Life Expectancy at Birth (years)',
'Female Life Expectancy at Birth (years)',
'Life Expectancy at Age 15, both sexes (years)',
'Male Life Expectancy at Age 15 (years)',
'Female Life Expectancy at Age 15 (years)',
'Life Expectancy at Age 65, both sexes (years)',
'Male Life Expectancy at Age 65 (years)',
'Female Life Expectancy at Age 65 (years)',
'Life Expectancy at Age 80, both sexes (years)',
'Male Life Expectancy at Age 80 (years)',
'Female Life Expectancy at Age 80 (years)',
'Infant Deaths, under age 1 (thousands)',
'Infant Mortality Rate (infant deaths per 1,000 live births)',
'Live Births Surviving to Age 1 (thousands)',
'Under-Five Deaths, under age 5 (thousands)',
'Under-Five Mortality (deaths under age 5 per 1,000 live births)',
'Mortality before Age 40, both sexes (deaths under age 40 per 1,000 live births)',
'Male Mortality before Age 40 (deaths under age 40 per 1,000 male live births)',
'Female Mortality before Age 40 (deaths under age 40 per 1,000 female live births)',
'Mortality before Age 60, both sexes (deaths under age 60 per 1,000 live births)',
'Male Mortality before Age 60 (deaths under age 60 per 1,000 male live births)',
'Female Mortality before Age 60 (deaths under age 60 per 1,000 female live births)',
'Mortality between Age 15 and 50, both sexes (deaths under age 50 per 1,000 alive at age 15)',
'Male Mortality between Age 15 and 50 (deaths under age 50 per 1,000 males alive at age 15)',
'Female Mortality between Age 15 and 50 (deaths under age 50 per 1,000 females alive at age 15)',
'Mortality between Age 15 and 60, both sexes (deaths under age 60 per 1,000 alive at age 15)',
'Male Mortality between Age 15 and 60 (deaths under age 60 per 1,000 males alive at age 15)',
'Female Mortality between Age 15 and 60 (deaths under age 60 per 1,000 females alive at age 15)',
'Net Number of Migrants (thousands)',
'Net Migration Rate (per 1,000 population)'],
dtype='object')
Optional data storing step: You may save your raw dataset files to the local data store before moving to the next step.
pop_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 21983 entries, 0 to 21982 Data columns (total 65 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Index 21983 non-null int64 1 Variant 21983 non-null object 2 Region, subregion, country or area * 21983 non-null object 3 Notes 5628 non-null object 4 Location code 21983 non-null int64 5 ISO3 Alpha-code 17538 non-null object 6 ISO2 Alpha-code 17464 non-null object 7 SDMX code** 20942 non-null float64 8 Type 21983 non-null object 9 Parent code 21983 non-null int64 10 Year 21978 non-null float64 11 Total Population, as of 1 January (thousands) 21983 non-null object 12 Total Population, as of 1 July (thousands) 21983 non-null object 13 Male Population, as of 1 July (thousands) 21983 non-null object 14 Female Population, as of 1 July (thousands) 21983 non-null object 15 Population Density, as of 1 July (persons per square km) 21983 non-null object 16 Population Sex Ratio, as of 1 July (males per 100 females) 21983 non-null object 17 Median Age, as of 1 July (years) 21983 non-null object 18 Natural Change, Births minus Deaths (thousands) 21983 non-null object 19 Rate of Natural Change (per 1,000 population) 21983 non-null object 20 Population Change (thousands) 21983 non-null object 21 Population Growth Rate (percentage) 21983 non-null object 22 Population Annual Doubling Time (years) 21983 non-null object 23 Births (thousands) 21983 non-null object 24 Births by women aged 15 to 19 (thousands) 21983 non-null object 25 Crude Birth Rate (births per 1,000 population) 21983 non-null object 26 Total Fertility Rate (live births per woman) 21983 non-null object 27 Net Reproduction Rate (surviving daughters per woman) 21983 non-null object 28 Mean Age Childbearing (years) 21983 non-null object 29 Sex Ratio at Birth (males per 100 female births) 21983 non-null object 30 Total Deaths (thousands) 21983 non-null object 31 Male Deaths (thousands) 21983 non-null object 32 Female Deaths (thousands) 21983 non-null object 33 Crude Death Rate (deaths per 1,000 population) 21983 non-null object 34 Life Expectancy at Birth, both sexes (years) 21983 non-null object 35 Male Life Expectancy at Birth (years) 21983 non-null object 36 Female Life Expectancy at Birth (years) 21983 non-null object 37 Life Expectancy at Age 15, both sexes (years) 21983 non-null object 38 Male Life Expectancy at Age 15 (years) 21983 non-null object 39 Female Life Expectancy at Age 15 (years) 21983 non-null object 40 Life Expectancy at Age 65, both sexes (years) 21983 non-null object 41 Male Life Expectancy at Age 65 (years) 21983 non-null object 42 Female Life Expectancy at Age 65 (years) 21983 non-null object 43 Life Expectancy at Age 80, both sexes (years) 21983 non-null object 44 Male Life Expectancy at Age 80 (years) 21983 non-null object 45 Female Life Expectancy at Age 80 (years) 21983 non-null object 46 Infant Deaths, under age 1 (thousands) 21983 non-null object 47 Infant Mortality Rate (infant deaths per 1,000 live births) 21983 non-null object 48 Live Births Surviving to Age 1 (thousands) 21983 non-null object 49 Under-Five Deaths, under age 5 (thousands) 21983 non-null object 50 Under-Five Mortality (deaths under age 5 per 1,000 live births) 21983 non-null object 51 Mortality before Age 40, both sexes (deaths under age 40 per 1,000 live births) 21983 non-null object 52 Male Mortality before Age 40 (deaths under age 40 per 1,000 male live births) 21983 non-null object 53 Female Mortality before Age 40 (deaths under age 40 per 1,000 female live births) 21983 non-null object 54 Mortality before Age 60, both sexes (deaths under age 60 per 1,000 live births) 21983 non-null object 55 Male Mortality before Age 60 (deaths under age 60 per 1,000 male live births) 21983 non-null object 56 Female Mortality before Age 60 (deaths under age 60 per 1,000 female live births) 21983 non-null object 57 Mortality between Age 15 and 50, both sexes (deaths under age 50 per 1,000 alive at age 15) 21983 non-null object 58 Male Mortality between Age 15 and 50 (deaths under age 50 per 1,000 males alive at age 15) 21983 non-null object 59 Female Mortality between Age 15 and 50 (deaths under age 50 per 1,000 females alive at age 15) 21983 non-null object 60 Mortality between Age 15 and 60, both sexes (deaths under age 60 per 1,000 alive at age 15) 21983 non-null object 61 Male Mortality between Age 15 and 60 (deaths under age 60 per 1,000 males alive at age 15) 21983 non-null object 62 Female Mortality between Age 15 and 60 (deaths under age 60 per 1,000 females alive at age 15) 21983 non-null object 63 Net Number of Migrants (thousands) 21983 non-null object 64 Net Migration Rate (per 1,000 population) 21983 non-null object dtypes: float64(2), int64(3), object(60) memory usage: 10.9+ MB
2. Assess dataΒΆ
Dataset 1: WHO Alcohol-related disease mortalityΒΆ
Quality Issue 1:ΒΆ
# Inspecting the dataframe visually
who_alcohol.head(50)
| Id | IndicatorCode | SpatialDimType | SpatialDim | ParentLocationCode | TimeDimType | ParentLocation | Dim1Type | Dim1 | TimeDim | ... | DataSourceDim | Value | NumericValue | Low | High | Comments | Date | TimeDimensionValue | TimeDimensionBegin | TimeDimensionEnd | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 6284754 | SA_0000001473 | COUNTRY | SWE | EUR | YEAR | Europe | SEX | SEX_BTSX | 1997 | ... | None | 12.8 | 12.8 | None | None | None | 2013-06-11T14:15:34+02:00 | 1997 | 1997-01-01T00:00:00+01:00 | 1997-12-31T00:00:00+01:00 |
| 1 | 385071 | SA_0000001473 | COUNTRY | LTU | EUR | YEAR | Europe | SEX | SEX_MLE | 1990 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 1990 | 1990-01-01T00:00:00+01:00 | 1990-12-31T00:00:00+01:00 |
| 2 | 1375206 | SA_0000001473 | COUNTRY | DEU | EUR | YEAR | Europe | SEX | SEX_MLE | 1991 | ... | None | 30.7 | 30.7 | None | None | None | 2013-06-11T14:15:34+02:00 | 1991 | 1991-01-01T00:00:00+01:00 | 1991-12-31T00:00:00+01:00 |
| 3 | 2333151 | SA_0000001473 | COUNTRY | LTU | EUR | YEAR | Europe | SEX | SEX_FMLE | 1993 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 1993 | 1993-01-01T00:00:00+01:00 | 1993-12-31T00:00:00+01:00 |
| 4 | 4299804 | SA_0000001473 | COUNTRY | NOR | EUR | YEAR | Europe | SEX | SEX_FMLE | 2003 | ... | None | 4.2 | 4.2 | None | None | None | 2013-06-11T14:15:34+02:00 | 2003 | 2003-01-01T00:00:00+01:00 | 2003-12-31T00:00:00+01:00 |
| 5 | 8555397 | SA_0000001473 | COUNTRY | MEX | AMR | YEAR | Americas | SEX | SEX_MLE | 1995 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 1995 | 1995-01-01T00:00:00+01:00 | 1995-12-31T00:00:00+01:00 |
| 6 | 9810405 | SA_0000001473 | COUNTRY | MHL | WPR | YEAR | Western Pacific | SEX | SEX_BTSX | 2005 | ... | None | 33.9 | 33.9 | None | None | None | 2013-06-11T14:15:34+02:00 | 2005 | 2005-01-01T00:00:00+01:00 | 2005-12-31T00:00:00+01:00 |
| 7 | 1445164 | SA_0000001473 | COUNTRY | NOR | EUR | YEAR | Europe | SEX | SEX_BTSX | 2002 | ... | None | 10.9 | 10.9 | None | None | None | 2013-06-11T14:15:34+02:00 | 2002 | 2002-01-01T00:00:00+01:00 | 2002-12-31T00:00:00+01:00 |
| 8 | 6767273 | SA_0000001473 | COUNTRY | MEX | AMR | YEAR | Americas | SEX | SEX_BTSX | 2001 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 2001 | 2001-01-01T00:00:00+01:00 | 2001-12-31T00:00:00+01:00 |
| 9 | 3380008 | SA_0000001473 | COUNTRY | ISL | EUR | YEAR | Europe | SEX | SEX_MLE | 2000 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 2000 | 2000-01-01T00:00:00+01:00 | 2000-12-31T00:00:00+01:00 |
| 10 | 1305540 | SA_0000001473 | COUNTRY | DEU | EUR | YEAR | Europe | SEX | SEX_MLE | 1993 | ... | None | 32.1 | 32.1 | None | None | None | 2013-06-11T14:15:34+02:00 | 1993 | 1993-01-01T00:00:00+01:00 | 1993-12-31T00:00:00+01:00 |
| 11 | 8004200 | SA_0000001473 | COUNTRY | DNK | EUR | YEAR | Europe | SEX | SEX_BTSX | 1994 | ... | None | 26.9 | 26.9 | None | None | None | 2013-06-11T14:15:34+02:00 | 1994 | 1994-01-01T00:00:00+01:00 | 1994-12-31T00:00:00+01:00 |
| 12 | 2644247 | SA_0000001473 | COUNTRY | CZE | EUR | YEAR | Europe | SEX | SEX_FMLE | 2002 | ... | None | 14.0 | 14.0 | None | None | None | 2013-06-11T14:15:34+02:00 | 2002 | 2002-01-01T00:00:00+01:00 | 2002-12-31T00:00:00+01:00 |
| 13 | 2542690 | SA_0000001473 | COUNTRY | DEU | EUR | YEAR | Europe | SEX | SEX_MLE | 2003 | ... | None | 31.6 | 31.6 | None | None | None | 2013-06-11T14:15:34+02:00 | 2003 | 2003-01-01T00:00:00+01:00 | 2003-12-31T00:00:00+01:00 |
| 14 | 5124744 | SA_0000001473 | COUNTRY | DEU | EUR | YEAR | Europe | SEX | SEX_FMLE | 1996 | ... | None | 10.4 | 10.4 | None | None | None | 2013-06-11T14:15:34+02:00 | 1996 | 1996-01-01T00:00:00+01:00 | 1996-12-31T00:00:00+01:00 |
| 15 | 195777 | SA_0000001473 | COUNTRY | LVA | EUR | YEAR | Europe | SEX | SEX_FMLE | 2005 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 2005 | 2005-01-01T00:00:00+01:00 | 2005-12-31T00:00:00+01:00 |
| 16 | 7617973 | SA_0000001473 | COUNTRY | NOR | EUR | YEAR | Europe | SEX | SEX_BTSX | 2006 | ... | None | 9.0 | 9.0 | None | None | None | 2013-06-11T14:15:34+02:00 | 2006 | 2006-01-01T00:00:00+01:00 | 2006-12-31T00:00:00+01:00 |
| 17 | 2720479 | SA_0000001473 | COUNTRY | ISL | EUR | YEAR | Europe | SEX | SEX_MLE | 2001 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 2001 | 2001-01-01T00:00:00+01:00 | 2001-12-31T00:00:00+01:00 |
| 18 | 1496217 | SA_0000001473 | COUNTRY | HUN | EUR | YEAR | Europe | SEX | SEX_FMLE | 2002 | ... | None | 50.0 | 50.0 | None | None | None | 2013-06-11T14:15:34+02:00 | 2002 | 2002-01-01T00:00:00+01:00 | 2002-12-31T00:00:00+01:00 |
| 19 | 7402119 | SA_0000001473 | COUNTRY | RUS | EUR | YEAR | Europe | SEX | SEX_MLE | 1994 | ... | None | 863.0 | 863.0 | None | None | None | 2013-06-11T14:15:34+02:00 | 1994 | 1994-01-01T00:00:00+01:00 | 1994-12-31T00:00:00+01:00 |
| 20 | 5505314 | SA_0000001473 | COUNTRY | ISL | EUR | YEAR | Europe | SEX | SEX_BTSX | 1997 | ... | None | 1.9 | 1.9 | None | None | None | 2013-06-11T14:15:34+02:00 | 1997 | 1997-01-01T00:00:00+01:00 | 1997-12-31T00:00:00+01:00 |
| 21 | 2276423 | SA_0000001473 | COUNTRY | ISL | EUR | YEAR | Europe | SEX | SEX_BTSX | 1998 | ... | None | 4.3 | 4.3 | None | None | None | 2013-06-11T14:15:34+02:00 | 1998 | 1998-01-01T00:00:00+01:00 | 1998-12-31T00:00:00+01:00 |
| 22 | 4084969 | SA_0000001473 | COUNTRY | UKR | EUR | YEAR | Europe | SEX | SEX_FMLE | 2004 | ... | None | 72.2 | 72.2 | None | None | None | 2013-06-11T14:15:34+02:00 | 2004 | 2004-01-01T00:00:00+01:00 | 2004-12-31T00:00:00+01:00 |
| 23 | 4034217 | SA_0000001473 | COUNTRY | DEU | EUR | YEAR | Europe | SEX | SEX_BTSX | 2004 | ... | None | 19.8 | 19.8 | None | None | None | 2013-06-11T14:15:34+02:00 | 2004 | 2004-01-01T00:00:00+01:00 | 2004-12-31T00:00:00+01:00 |
| 24 | 8929210 | SA_0000001473 | COUNTRY | RUS | EUR | YEAR | Europe | SEX | SEX_FMLE | 2002 | ... | None | 47.0 | 47.0 | None | None | None | 2013-06-11T14:15:34+02:00 | 2002 | 2002-01-01T00:00:00+01:00 | 2002-12-31T00:00:00+01:00 |
| 25 | 2390944 | SA_0000001473 | COUNTRY | DEU | EUR | YEAR | Europe | SEX | SEX_MLE | 2006 | ... | None | 28.7 | 28.7 | None | None | None | 2013-06-11T14:15:34+02:00 | 2006 | 2006-01-01T00:00:00+01:00 | 2006-12-31T00:00:00+01:00 |
| 26 | 6904596 | SA_0000001473 | COUNTRY | FIN | EUR | YEAR | Europe | SEX | SEX_FMLE | 1998 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 1998 | 1998-01-01T00:00:00+01:00 | 1998-12-31T00:00:00+01:00 |
| 27 | 7267840 | SA_0000001473 | COUNTRY | NOR | EUR | YEAR | Europe | SEX | SEX_FMLE | 1998 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 1998 | 1998-01-01T00:00:00+01:00 | 1998-12-31T00:00:00+01:00 |
| 28 | 3647096 | SA_0000001473 | COUNTRY | NZL | WPR | YEAR | Western Pacific | SEX | SEX_FMLE | 2007 | ... | None | 1.5 | 1.5 | None | None | None | 2013-06-11T14:15:34+02:00 | 2007 | 2007-01-01T00:00:00+01:00 | 2007-12-31T00:00:00+01:00 |
| 29 | 328689 | SA_0000001473 | COUNTRY | DNK | EUR | YEAR | Europe | SEX | SEX_MLE | 1994 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 1994 | 1994-01-01T00:00:00+01:00 | 1994-12-31T00:00:00+01:00 |
| 30 | 8049111 | SA_0000001473 | COUNTRY | SWE | EUR | YEAR | Europe | SEX | SEX_BTSX | 2007 | ... | None | 10.9 | 10.9 | None | None | None | 2013-06-11T14:15:34+02:00 | 2007 | 2007-01-01T00:00:00+01:00 | 2007-12-31T00:00:00+01:00 |
| 31 | 1566065 | SA_0000001473 | COUNTRY | DEU | EUR | YEAR | Europe | SEX | SEX_MLE | 1994 | ... | None | 33.3 | 33.3 | None | None | None | 2013-06-11T14:15:34+02:00 | 1994 | 1994-01-01T00:00:00+01:00 | 1994-12-31T00:00:00+01:00 |
| 32 | 7535722 | SA_0000001473 | COUNTRY | SWE | EUR | YEAR | Europe | SEX | SEX_FMLE | 1998 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 1998 | 1998-01-01T00:00:00+01:00 | 1998-12-31T00:00:00+01:00 |
| 33 | 5283617 | SA_0000001473 | COUNTRY | DEU | EUR | YEAR | Europe | SEX | SEX_BTSX | 2003 | ... | None | 20.6 | 20.6 | None | None | None | 2013-06-11T14:15:34+02:00 | 2003 | 2003-01-01T00:00:00+01:00 | 2003-12-31T00:00:00+01:00 |
| 34 | 4281335 | SA_0000001473 | COUNTRY | DEU | EUR | YEAR | Europe | SEX | SEX_BTSX | 2002 | ... | None | 20.7 | 20.7 | None | None | None | 2013-06-11T14:15:34+02:00 | 2002 | 2002-01-01T00:00:00+01:00 | 2002-12-31T00:00:00+01:00 |
| 35 | 10020081 | SA_0000001473 | COUNTRY | LVA | EUR | YEAR | Europe | SEX | SEX_MLE | 1993 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 1993 | 1993-01-01T00:00:00+01:00 | 1993-12-31T00:00:00+01:00 |
| 36 | 4319503 | SA_0000001473 | COUNTRY | ISL | EUR | YEAR | Europe | SEX | SEX_BTSX | 2001 | ... | None | 3.2 | 3.2 | None | None | None | 2013-06-11T14:15:34+02:00 | 2001 | 2001-01-01T00:00:00+01:00 | 2001-12-31T00:00:00+01:00 |
| 37 | 601695 | SA_0000001473 | COUNTRY | DEU | EUR | YEAR | Europe | SEX | SEX_BTSX | 1991 | ... | None | 19.7 | 19.7 | None | None | None | 2013-06-11T14:15:34+02:00 | 1991 | 1991-01-01T00:00:00+01:00 | 1991-12-31T00:00:00+01:00 |
| 38 | 6628496 | SA_0000001473 | COUNTRY | SVN | EUR | YEAR | Europe | SEX | SEX_MLE | 2007 | ... | None | 118.2 | 118.2 | None | None | None | 2013-06-11T14:15:34+02:00 | 2007 | 2007-01-01T00:00:00+01:00 | 2007-12-31T00:00:00+01:00 |
| 39 | 1103300 | SA_0000001473 | COUNTRY | LVA | EUR | YEAR | Europe | SEX | SEX_FMLE | 2007 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 2007 | 2007-01-01T00:00:00+01:00 | 2007-12-31T00:00:00+01:00 |
| 40 | 7067644 | SA_0000001473 | COUNTRY | GBR | EUR | YEAR | Europe | SEX | SEX_MLE | 2008 | ... | None | 18.7 | 18.7 | None | None | None | 2013-06-11T14:15:34+02:00 | 2008 | 2008-01-01T00:00:00+01:00 | 2008-12-31T00:00:00+01:00 |
| 41 | 1654880 | SA_0000001473 | COUNTRY | MHL | WPR | YEAR | Western Pacific | SEX | SEX_BTSX | 2007 | ... | None | 29.0 | 29.0 | None | None | None | 2013-06-11T14:15:34+02:00 | 2007 | 2007-01-01T00:00:00+01:00 | 2007-12-31T00:00:00+01:00 |
| 42 | 5784852 | SA_0000001473 | COUNTRY | NOR | EUR | YEAR | Europe | SEX | SEX_BTSX | 1998 | ... | None | 11.6 | 11.6 | None | None | None | 2013-06-11T14:15:34+02:00 | 1998 | 1998-01-01T00:00:00+01:00 | 1998-12-31T00:00:00+01:00 |
| 43 | 9119889 | SA_0000001473 | COUNTRY | HUN | EUR | YEAR | Europe | SEX | SEX_BTSX | 1993 | ... | None | 67.2 | 67.2 | None | None | None | 2013-06-11T14:15:34+02:00 | 1993 | 1993-01-01T00:00:00+01:00 | 1993-12-31T00:00:00+01:00 |
| 44 | 6654025 | SA_0000001473 | COUNTRY | LVA | EUR | YEAR | Europe | SEX | SEX_BTSX | 2007 | ... | None | 19.0 | 19.0 | None | None | None | 2013-06-11T14:15:34+02:00 | 2007 | 2007-01-01T00:00:00+01:00 | 2007-12-31T00:00:00+01:00 |
| 45 | 8429412 | SA_0000001473 | COUNTRY | SWE | EUR | YEAR | Europe | SEX | SEX_FMLE | 1999 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 1999 | 1999-01-01T00:00:00+01:00 | 1999-12-31T00:00:00+01:00 |
| 46 | 1274636 | SA_0000001473 | COUNTRY | MHL | WPR | YEAR | Western Pacific | SEX | SEX_FMLE | 2007 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 2007 | 2007-01-01T00:00:00+01:00 | 2007-12-31T00:00:00+01:00 |
| 47 | 4966910 | SA_0000001473 | COUNTRY | NOR | EUR | YEAR | Europe | SEX | SEX_FMLE | 2000 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 2000 | 2000-01-01T00:00:00+01:00 | 2000-12-31T00:00:00+01:00 |
| 48 | 3952159 | SA_0000001473 | COUNTRY | LVA | EUR | YEAR | Europe | SEX | SEX_MLE | 2007 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 2007 | 2007-01-01T00:00:00+01:00 | 2007-12-31T00:00:00+01:00 |
| 49 | 5778612 | SA_0000001473 | COUNTRY | DEU | EUR | YEAR | Europe | SEX | SEX_BTSX | 1997 | ... | None | 21.7 | 21.7 | None | None | None | 2013-06-11T14:15:34+02:00 | 1997 | 1997-01-01T00:00:00+01:00 | 1997-12-31T00:00:00+01:00 |
50 rows Γ 25 columns
who_alcohol.sample(5)
| Id | IndicatorCode | SpatialDimType | SpatialDim | ParentLocationCode | TimeDimType | ParentLocation | Dim1Type | Dim1 | TimeDim | ... | DataSourceDim | Value | NumericValue | Low | High | Comments | Date | TimeDimensionValue | TimeDimensionBegin | TimeDimensionEnd | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 339 | 6093941 | SA_0000001473 | COUNTRY | LVA | EUR | YEAR | Europe | SEX | SEX_BTSX | 2009 | ... | None | 19.0 | 19.0 | None | None | None | 2013-06-11T14:15:34+02:00 | 2009 | 2009-01-01T00:00:00+01:00 | 2009-12-31T00:00:00+01:00 |
| 255 | 9161552 | SA_0000001473 | COUNTRY | FIN | EUR | YEAR | Europe | SEX | SEX_FMLE | 2009 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 2009 | 2009-01-01T00:00:00+01:00 | 2009-12-31T00:00:00+01:00 |
| 174 | 5361902 | SA_0000001473 | COUNTRY | NOR | EUR | YEAR | Europe | SEX | SEX_FMLE | 1999 | ... | None | . | NaN | None | None | None | 2013-06-11T14:15:34+02:00 | 1999 | 1999-01-01T00:00:00+01:00 | 1999-12-31T00:00:00+01:00 |
| 297 | 5909478 | SA_0000001473 | COUNTRY | CZE | EUR | YEAR | Europe | SEX | SEX_MLE | 2002 | ... | None | 90.0 | 90.0 | None | None | None | 2013-06-11T14:15:34+02:00 | 2002 | 2002-01-01T00:00:00+01:00 | 2002-12-31T00:00:00+01:00 |
| 337 | 1247771 | SA_0000001473 | COUNTRY | SWE | EUR | YEAR | Europe | SEX | SEX_MLE | 2002 | ... | None | 27.0 | 27.0 | None | None | None | 2013-06-11T14:15:34+02:00 | 2002 | 2002-01-01T00:00:00+01:00 | 2002-12-31T00:00:00+01:00 |
5 rows Γ 25 columns
who_alcohol.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 358 entries, 0 to 357 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 358 non-null int64 1 IndicatorCode 358 non-null object 2 SpatialDimType 358 non-null object 3 SpatialDim 358 non-null object 4 ParentLocationCode 358 non-null object 5 TimeDimType 358 non-null object 6 ParentLocation 358 non-null object 7 Dim1Type 358 non-null object 8 Dim1 358 non-null object 9 TimeDim 358 non-null int64 10 Dim2Type 0 non-null object 11 Dim2 0 non-null object 12 Dim3Type 0 non-null object 13 Dim3 0 non-null object 14 DataSourceDimType 0 non-null object 15 DataSourceDim 0 non-null object 16 Value 358 non-null object 17 NumericValue 209 non-null float64 18 Low 0 non-null object 19 High 0 non-null object 20 Comments 0 non-null object 21 Date 358 non-null object 22 TimeDimensionValue 358 non-null object 23 TimeDimensionBegin 358 non-null object 24 TimeDimensionEnd 358 non-null object dtypes: float64(1), int64(2), object(22) memory usage: 70.0+ KB
Missing mortality values (NumericValue has many NaNs)ΒΆ
When I inspect the dataset visually NumericValue column has so many NaN values.
# Inspecting the dataframe programmatically
who_alcohol["NumericValue"].isna().sum()
149
There are 149 missing values. Therefore, approximately 41% of the total 358 rows is missing.
Issue and justification: This is a quality problem since missing values prevent conducting calculations and it corrupts the comparisons between countries during EDA. Lastly, it may led missing rows after merging.
Quality Issue 2:ΒΆ
Value column includes ".", this actually should be numeric.
# Inspecting the dataframe programmatically
who_alcohol["Value"].unique()[:10]
array(['12.8', '.', '30.7', '4.2', '33.9', '10.9', '32.1', '26.9', '14.0',
'31.6'], dtype=object)
Issue and justification: Value region is in string format and "." identificator means missing in WHO API.
This is a quality problem since:
- We can not conduct numerical operations directly on strings
- "." values creates error when we try to convert them to
float - Inconsistent with
NumericValue
Tidiness Issue 1:ΒΆ
FILL IN - Inspecting the dataframe visuallyΒΆ
Sex (Dim1) is a variable stored as a value, not a separate column. When we complete the visual assessment it can be seen that inside the Dim1 column there are categories named "SEX_MLE", "SEX_FMLE", "SEX_BTSX" respectively. This actually another variable (sex), but it is held as a categorical value in a single column.
# Inspecting the dataframe programmatically
who_alcohol["Dim1"].value_counts()
Dim1 SEX_FMLE 121 SEX_BTSX 119 SEX_MLE 118 Name: count, dtype: int64
Issue and justification:
In order to have a tidy dataset, every row should be an observation and every column should be a value.
Tidiness Issue 2:ΒΆ
FILL IN - Inspecting the dataframe visuallyΒΆ
There are columns with 0 useful information (Dim2, Dim3, Low, High, Comments).
# Inspecting the dataframe programmatically
who_alcohol[["Dim2Type", "Dim2", "Dim3Type","Dim3", "Low", "High", "Comments"]].info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 358 entries, 0 to 357 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Dim2Type 0 non-null object 1 Dim2 0 non-null object 2 Dim3Type 0 non-null object 3 Dim3 0 non-null object 4 Low 0 non-null object 5 High 0 non-null object 6 Comments 0 non-null object dtypes: object(7) memory usage: 19.7+ KB
who_alcohol.isna().sum().sort_values()
Id 0 TimeDimensionValue 0 Date 0 Value 0 TimeDimensionBegin 0 TimeDim 0 Dim1 0 TimeDimensionEnd 0 ParentLocation 0 IndicatorCode 0 SpatialDimType 0 Dim1Type 0 SpatialDim 0 TimeDimType 0 ParentLocationCode 0 NumericValue 149 Dim3 358 DataSourceDimType 358 DataSourceDim 358 Dim2 358 Low 358 High 358 Comments 358 Dim2Type 358 Dim3Type 358 dtype: int64
Issue and justification: Several columns contain 0 non-null values and provide no usable information. These variables do not contribute any meaningful information to the dataset. They violate the "tidy data" principle because they add unneccessary structural noise and do not represent actual observations or variables.
Keeping them would increase dataset complexity, create clutter during analysis and reduce interpretability.
Dataset 2: UN Population DatasetΒΆ
Quality Issue 1:ΒΆ
# Inspecting the dataframe visually
# Initial Inspection
pop_df.head(20)
| Index | Variant | Region, subregion, country or area * | Notes | Location code | ISO3 Alpha-code | ISO2 Alpha-code | SDMX code** | Type | Parent code | ... | Male Mortality before Age 60 (deaths under age 60 per 1,000 male live births) | Female Mortality before Age 60 (deaths under age 60 per 1,000 female live births) | Mortality between Age 15 and 50, both sexes (deaths under age 50 per 1,000 alive at age 15) | Male Mortality between Age 15 and 50 (deaths under age 50 per 1,000 males alive at age 15) | Female Mortality between Age 15 and 50 (deaths under age 50 per 1,000 females alive at age 15) | Mortality between Age 15 and 60, both sexes (deaths under age 60 per 1,000 alive at age 15) | Male Mortality between Age 15 and 60 (deaths under age 60 per 1,000 males alive at age 15) | Female Mortality between Age 15 and 60 (deaths under age 60 per 1,000 females alive at age 15) | Net Number of Migrants (thousands) | Net Migration Rate (per 1,000 population) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 580.5 | 497.388 | 238.516 | 268.734 | 207.62 | 375.391 | 426.221 | 322.65 | 0 | 0 |
| 1 | 2 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 566.566 | 488.435 | 229.703 | 256.236 | 202.734 | 365.226 | 412.76 | 316.395 | 0 | 0 |
| 2 | 3 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 546.444 | 475.37 | 217.311 | 238.56 | 195.926 | 350.613 | 393.364 | 307.314 | 0 | 0 |
| 3 | 4 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 535.811 | 467.361 | 211.257 | 230.961 | 191.482 | 342.734 | 383.875 | 301.27 | 0 | 0 |
| 4 | 5 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 522.058 | 455.621 | 203.337 | 221.377 | 185.296 | 332.327 | 371.737 | 292.807 | 0 | 0 |
| 5 | 6 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 514.72 | 447.104 | 199.594 | 217.779 | 181.41 | 327.447 | 367.308 | 287.446 | 0 | 0 |
| 6 | 7 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 507.214 | 439.595 | 196.203 | 214.202 | 178.21 | 323.096 | 362.805 | 283.257 | 0 | 0 |
| 7 | 8 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 503.041 | 435.133 | 194.296 | 212.197 | 176.408 | 320.764 | 360.581 | 280.835 | 0 | 0 |
| 8 | 9 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 495.575 | 427.387 | 190.913 | 208.967 | 172.851 | 315.308 | 355.044 | 275.449 | 0 | 0 |
| 9 | 10 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 525.642 | 455.314 | 202.942 | 223.707 | 182.052 | 332.92 | 375.396 | 290.021 | 0 | 0 |
| 10 | 11 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 551.721 | 480.332 | 214.608 | 237.522 | 191.453 | 349.468 | 393.991 | 304.225 | 0 | 0 |
| 11 | 12 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 516.3 | 445.337 | 199.138 | 220.017 | 178.094 | 327.37 | 369.846 | 284.485 | 0 | 0 |
| 12 | 13 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 473.158 | 403.755 | 179.709 | 197.386 | 161.984 | 300.651 | 340.388 | 260.936 | 0 | 0 |
| 13 | 14 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 466.352 | 397.364 | 176.593 | 193.945 | 159.168 | 296.424 | 335.714 | 257.204 | 0 | 0 |
| 14 | 15 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 459.158 | 389.924 | 173.07 | 190.414 | 155.613 | 291.361 | 330.474 | 252.34 | 0 | 0 |
| 15 | 16 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 465.949 | 391.46 | 176.325 | 196.739 | 155.561 | 295.373 | 337.598 | 252.861 | 0 | 0 |
| 16 | 17 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 455.293 | 384.766 | 170.328 | 188.421 | 152.005 | 288.002 | 327.742 | 248.281 | 0 | 0 |
| 17 | 18 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 447.962 | 376.879 | 165.537 | 184.02 | 146.783 | 282.174 | 322.109 | 242.23 | 0 | 0 |
| 18 | 19 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 439.806 | 368.526 | 161.587 | 180.165 | 142.707 | 277.451 | 317.265 | 237.608 | 0 | 0 |
| 19 | 20 | Estimates | World | NaN | 900 | NaN | NaN | 1.0 | World | 0 | ... | 434.278 | 363.088 | 158.623 | 177.186 | 139.726 | 274.002 | 313.646 | 234.306 | 0 | 0 |
20 rows Γ 65 columns
In the first several rows of the dataset:
Region, subregion, country or area * = "World"
ISO3 Alpha-code = NaN
Type = "World"
The dataset contains non-country summary rows.
# Inspecting the dataframe programmatically
pop_df["Type"].value_counts().head(20)
Type Country/Area 17538 Subregion 1554 SDG region 666 Special other 666 Income Group 666 Region 444 Development Group 370 World 74 Label/Separator 5 Name: count, dtype: int64
pop_df["ISO3 Alpha-code"].isna().sum()
4445
Returns a large number of missing ISO codes - these correspond to non-country entries. This is a data quality problem since:
- WHO data contains only actual countries
- Rows like "World", "Africa", "Europe" can not be merged
Quality Issue 2:ΒΆ
Inspecting the dataframe visuallyΒΆ
Notes column has unusable or missing values. This indicates a column that was imported but contains no meaningful data.
# Inspecting the dataframe programmatically
pop_df["Notes"].isna().sum()
16355
Tidiness Issue 1:ΒΆ
Inspecting the dataframe visuallyΒΆ
Country, region, and world data stored together and this is a tidiness issue. It breaks a core rule of tidy data:
Each table should contain only one type of observational unit.
- Rows are World-level, not country level.
- "Type" column = World, Region, Subregion, Country
# Inspecting the dataframe programmatically
pop_df["Type"].value_counts()
Type Country/Area 17538 Subregion 1554 SDG region 666 Special other 666 Income Group 666 Region 444 Development Group 370 World 74 Label/Separator 5 Name: count, dtype: int64
The above output points to the tidiness issue since:
- These represent different entries not one
- Must be filtered so the dataset contains only:
Type == "Country"before merging with WHO dataset.
Tidiness Issue 2:ΒΆ
Inspecting the dataframe visuallyΒΆ
Long, descriptive column names like Male Mortality before Age 60 include full descriptions and definitions. Tidy data requires column names be machine-friendly not long sentences. Long names complicate indexing, cleaning, plotting and merging.
# Inspecting the dataframe programmatically
pop_df.columns
Index(['Index', 'Variant', 'Region, subregion, country or area *', 'Notes',
'Location code', 'ISO3 Alpha-code', 'ISO2 Alpha-code', 'SDMX code**',
'Type', 'Parent code', 'Year',
'Total Population, as of 1 January (thousands)',
'Total Population, as of 1 July (thousands)',
'Male Population, as of 1 July (thousands)',
'Female Population, as of 1 July (thousands)',
'Population Density, as of 1 July (persons per square km)',
'Population Sex Ratio, as of 1 July (males per 100 females)',
'Median Age, as of 1 July (years)',
'Natural Change, Births minus Deaths (thousands)',
'Rate of Natural Change (per 1,000 population)',
'Population Change (thousands)', 'Population Growth Rate (percentage)',
'Population Annual Doubling Time (years)', 'Births (thousands)',
'Births by women aged 15 to 19 (thousands)',
'Crude Birth Rate (births per 1,000 population)',
'Total Fertility Rate (live births per woman)',
'Net Reproduction Rate (surviving daughters per woman)',
'Mean Age Childbearing (years)',
'Sex Ratio at Birth (males per 100 female births)',
'Total Deaths (thousands)', 'Male Deaths (thousands)',
'Female Deaths (thousands)',
'Crude Death Rate (deaths per 1,000 population)',
'Life Expectancy at Birth, both sexes (years)',
'Male Life Expectancy at Birth (years)',
'Female Life Expectancy at Birth (years)',
'Life Expectancy at Age 15, both sexes (years)',
'Male Life Expectancy at Age 15 (years)',
'Female Life Expectancy at Age 15 (years)',
'Life Expectancy at Age 65, both sexes (years)',
'Male Life Expectancy at Age 65 (years)',
'Female Life Expectancy at Age 65 (years)',
'Life Expectancy at Age 80, both sexes (years)',
'Male Life Expectancy at Age 80 (years)',
'Female Life Expectancy at Age 80 (years)',
'Infant Deaths, under age 1 (thousands)',
'Infant Mortality Rate (infant deaths per 1,000 live births)',
'Live Births Surviving to Age 1 (thousands)',
'Under-Five Deaths, under age 5 (thousands)',
'Under-Five Mortality (deaths under age 5 per 1,000 live births)',
'Mortality before Age 40, both sexes (deaths under age 40 per 1,000 live births)',
'Male Mortality before Age 40 (deaths under age 40 per 1,000 male live births)',
'Female Mortality before Age 40 (deaths under age 40 per 1,000 female live births)',
'Mortality before Age 60, both sexes (deaths under age 60 per 1,000 live births)',
'Male Mortality before Age 60 (deaths under age 60 per 1,000 male live births)',
'Female Mortality before Age 60 (deaths under age 60 per 1,000 female live births)',
'Mortality between Age 15 and 50, both sexes (deaths under age 50 per 1,000 alive at age 15)',
'Male Mortality between Age 15 and 50 (deaths under age 50 per 1,000 males alive at age 15)',
'Female Mortality between Age 15 and 50 (deaths under age 50 per 1,000 females alive at age 15)',
'Mortality between Age 15 and 60, both sexes (deaths under age 60 per 1,000 alive at age 15)',
'Male Mortality between Age 15 and 60 (deaths under age 60 per 1,000 males alive at age 15)',
'Female Mortality between Age 15 and 60 (deaths under age 60 per 1,000 females alive at age 15)',
'Net Number of Migrants (thousands)',
'Net Migration Rate (per 1,000 population)'],
dtype='object')
3. Clean dataΒΆ
Dataset 1: WHO Alcohol-related disease mortalityΒΆ
**Quality Issue 1: **ΒΆ
# Apply the cleaning strategy
"""
Fix "."
Issue:
- `Value` column contains "."
"""
# Convert Value to numeric, turning "." into NaN
who_alcohol["Value_clean"] = pd.to_numeric(who_alcohol["Value"], errors = "coerce")
**Quality Issue 2: **ΒΆ
# Apply the cleaning strategy
"""
Fix missing mortality values
Issue:
- `NumericValue` contains NaN for many rows
"""
# Fill NumericValue using Value_clean
who_alcohol["Mortality"] = who_alcohol["NumericValue"].fillna(who_alcohol["Value_clean"])
# Drop original columns
who_alcohol.drop(columns = ["NumericValue", "Value", "Value_clean"], inplace = True)
# Validate the cleaning was successful
who_alcohol["Mortality"].isna().sum()
149
Justification: During the cleaning process, I attempted to recover missing mortality values by converting the string-based Value column into a numeric format and using it to fill gaps in NumericValue. However, programmatic assessment showed that all rows with missing NumericValue also had Value = ".", meaning no additional numeric information existed to recover these observations. Because the missingness originates from the WHO source itself and not from a wrangling issue, I decided to retain these rows instead of removing them. Keeping them preserves the original dataset structure.
Tidiness Issue 1: Remove useless empty columnsΒΆ
# Apply the cleaning strategy
empty_cols = ["Dim2Type", "Dim2", "Dim3Type", "Dim3", "Low", "High", "Comments", "DataSourceDimType","DataSourceDim"]
who_alcohol.drop(columns = empty_cols, inplace = True)
# Validate the cleaning was successful
who_alcohol.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 358 entries, 0 to 357 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 358 non-null int64 1 IndicatorCode 358 non-null object 2 SpatialDimType 358 non-null object 3 SpatialDim 358 non-null object 4 ParentLocationCode 358 non-null object 5 TimeDimType 358 non-null object 6 ParentLocation 358 non-null object 7 Dim1Type 358 non-null object 8 Dim1 358 non-null object 9 TimeDim 358 non-null int64 10 Date 358 non-null object 11 TimeDimensionValue 358 non-null object 12 TimeDimensionBegin 358 non-null object 13 TimeDimensionEnd 358 non-null object 14 Mortality 209 non-null float64 dtypes: float64(1), int64(2), object(12) memory usage: 42.1+ KB
Justification: Columns "Dim2Type", "Dim2", "Dim3Type", "Dim3", "Low", "High", "Comments", "DataSourceDimType","DataSourceDim" contain 0 nΔ±n-null values. Therefore, empty columns were removed.
Tidiness Issue 2: Tidy sex variableΒΆ
# Apply the cleaning strategy
who_alcohol.rename(columns={"Dim1": "Sex"}, inplace=True)
who_alcohol = who_alcohol[who_alcohol["Sex"] == "SEX_BTSX"].copy()
# Validate the cleaning was successful
who_alcohol["Sex"].value_counts()
Sex SEX_BTSX 119 Name: count, dtype: int64
who_alcohol = who_alcohol[["SpatialDim", "ParentLocation", "TimeDim", "Mortality"]].copy()
who_alcohol.rename(columns={
"SpatialDim": "CountryCode",
"ParentLocation": "Country",
"TimeDim": "Year"
}, inplace=True)
who_alcohol.head()
| CountryCode | Country | Year | Mortality | |
|---|---|---|---|---|
| 0 | SWE | Europe | 1997 | 12.8 |
| 6 | MHL | Western Pacific | 2005 | 33.9 |
| 7 | NOR | Europe | 2002 | 10.9 |
| 8 | MEX | Americas | 2001 | NaN |
| 11 | DNK | Europe | 1994 | 26.9 |
Justification: Both sexes combined represents a standardized public health indicator and provides the most comparable country-year mortality values. Male-only and female-only categories were removed to maintain one observation per country-year, which is required for tidy data.
Dataset 2: UN Population DatasetΒΆ
Quality Issue 1: Keep only country-level rowsΒΆ
# Keep only country - level rows
# Keep only rows with ISO3 codes
pop_clean = pop_df[pop_df["ISO3 Alpha-code"].notna()].copy()
# Validation programmatically
pop_clean["Type"].value_counts()
Type Country/Area 17538 Name: count, dtype: int64
Quality Issue 2: Convert population column to numericΒΆ
# Convert population column to numeric
pop_clean["TotalPopulation_thousands"] = pd.to_numeric(
pop_clean["Total Population, as of 1 July (thousands)"], errors="coerce"
)
pop_clean["TotalPopulation"] = pop_clean["TotalPopulation_thousands"] * 1000
pop_clean["TotalPopulation"].describe()
count 1.753800e+04 mean 2.153491e+07 std 9.574493e+07 min 4.960000e+02 25% 2.204615e+05 50% 3.135058e+06 75% 1.089016e+07 max 1.438070e+09 Name: TotalPopulation, dtype: float64
pop_clean = pop_clean[[
"ISO3 Alpha-code",
"Region, subregion, country or area *",
"Year",
"TotalPopulation"
]].copy()
pop_clean.rename(columns={
"ISO3 Alpha-code": "CountryCode",
"Region, subregion, country or area *": "Country"
}, inplace=True)
pop_clean.head()
| CountryCode | Country | Year | TotalPopulation | |
|---|---|---|---|---|
| 2594 | BDI | Burundi | 1950.0 | 2254938.0 |
| 2595 | BDI | Burundi | 1951.0 | 2305746.0 |
| 2596 | BDI | Burundi | 1952.0 | 2355804.0 |
| 2597 | BDI | Burundi | 1953.0 | 2405186.0 |
| 2598 | BDI | Burundi | 1954.0 | 2454586.0 |
Remove unnecessary variables and combine datasetsΒΆ
Depending on the datasets, you can also peform the combination before the cleaning steps.
# Remove unnecessary variables and combine datasets
who_alcohol.head()
| CountryCode | Country | Year | Mortality | |
|---|---|---|---|---|
| 0 | SWE | Europe | 1997 | 12.8 |
| 6 | MHL | Western Pacific | 2005 | 33.9 |
| 7 | NOR | Europe | 2002 | 10.9 |
| 8 | MEX | Americas | 2001 | NaN |
| 11 | DNK | Europe | 1994 | 26.9 |
who_alcohol.info()
<class 'pandas.core.frame.DataFrame'> Index: 119 entries, 0 to 354 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CountryCode 119 non-null object 1 Country 119 non-null object 2 Year 119 non-null int64 3 Mortality 102 non-null float64 dtypes: float64(1), int64(1), object(2) memory usage: 4.6+ KB
merged = who_alcohol.merge(pop_clean,
on=["CountryCode", "Year"],
how="inner")
merged.head()
merged.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 119 entries, 0 to 118 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CountryCode 119 non-null object 1 Country_x 119 non-null object 2 Year 119 non-null int64 3 Mortality 102 non-null float64 4 Country_y 119 non-null object 5 TotalPopulation 119 non-null float64 dtypes: float64(2), int64(1), object(3) memory usage: 5.7+ KB
# Check if Country_x and Country_y identical
(merged["Country_x"] == merged["Country_y"]).all()
False
Some country names differ between WHO and UN datasets.
# Only see the mismatched rows
merged[merged["Country_x"] != merged["Country_y"]][["CountryCode", "Country_x", "Country_y"]]
| CountryCode | Country_x | Country_y | |
|---|---|---|---|
| 0 | SWE | Europe | Sweden |
| 1 | MHL | Western Pacific | Marshall Islands |
| 2 | NOR | Europe | Norway |
| 3 | MEX | Americas | Mexico |
| 4 | DNK | Europe | Denmark |
| ... | ... | ... | ... |
| 114 | TUR | Europe | TΓΌrkiye |
| 115 | LTU | Europe | Lithuania |
| 116 | DEU | Europe | Germany |
| 117 | MHL | Western Pacific | Marshall Islands |
| 118 | DEU | Europe | Germany |
119 rows Γ 3 columns
merged.rename(columns={"Country_x": "Region"}, inplace=True)
merged.rename(columns={"Country_y": "Country"}, inplace=True)
merged.head()
| CountryCode | Region | Year | Mortality | Country | TotalPopulation | |
|---|---|---|---|---|---|---|
| 0 | SWE | Europe | 1997 | 12.8 | Sweden | 8845524.0 |
| 1 | MHL | Western Pacific | 2005 | 33.9 | Marshall Islands | 51906.0 |
| 2 | NOR | Europe | 2002 | 10.9 | Norway | 4538014.0 |
| 3 | MEX | Americas | 2001 | NaN | Mexico | 100099099.0 |
| 4 | DNK | Europe | 1994 | 26.9 | Denmark | 5206190.0 |
4. Update the data storeΒΆ
- Must maintain different instances / versions of data (raw and cleaned data)
- Must name the dataset files informatively
- Ensuring both the raw and cleaned data is saved to your database/data store
import os
os.getcwd()
'/workspace'
# Create folders
os.makedirs("/workspace/data/raw", exist_ok=True)
os.makedirs("/workspace/data/clean", exist_ok=True)
# saving data
import shutil
# Save WHO raw dataset
who_raw_df.to_csv("/workspace/data/raw/who_alcohol_raw.csv", index=False)
# Save UN raw dataset
shutil.copy(
"WPP2024_GEN_F01_DEMOGRAPHIC_INDICATORS_COMPACT.xlsx",
"/workspace/data/raw/un_population_raw.xlsx"
)
'/workspace/data/raw/un_population_raw.xlsx'
# Save clean WHO dataset
who_alcohol.to_csv("/workspace/data/clean/who_alcohol.csv", index=False)
# Save clean UN dataset
pop_clean.to_csv("/workspace/data/clean/un_population_clean.csv", index=False)
# Save merged dataset
merged.to_csv("/workspace/data/clean/alcohol_population_merged.csv", index=False)
To ensure full reproducibility and meet Udacityβs rubric requirements, I created a structured folder system inside the /workspace directory. Raw datasets obtained from the WHO API and the UN World Population Prospects file were stored under /workspace/data/raw/. After performing all cleaning steps, the cleaned datasets were saved under /workspace/data/clean/. This approach keeps the original data intact, preserves multiple versions of the dataset, and maintains a clear workflow separation between gathering, cleaning, and analysis. The final merged dataset used for EDA is also stored in the cleaned data directory. This organization follows best practices for data engineering and ensures that all files are included when exporting the workspace for submission.
*Research question: Among the countries for which WHO alcohol-related mortality data is available, which ones experience the highest mortality burden relative to population size?
#Visual 1
import matplotlib.pyplot as plt
import seaborn as sns
merged["Mortality_per_100k"] = (merged["Mortality"] / merged["TotalPopulation"]) * 100000
# Bar Chart: Top 10 Countries by Alcohol-Related Mortality Rate
top10 = merged.sort_values("Mortality_per_100k", ascending=False).head(10)
plt.figure(figsize=(12,6))
sns.barplot(data=top10, x="Country", y="Mortality_per_100k", palette="magma")
plt.xticks(rotation=45)
plt.title("Top 10 European Countries by Alcohol-Related Mortality Rate (per 100k)")
plt.ylabel("Mortality per 100,000 population")
plt.xlabel("Country")
plt.show()
Answer to research question: Among countries where mortality data is available, Marshall Islands has a dramatically higher alcohol-related mortality rate compared to Slovenia and Iceland.
#Visual 2 - Scatter Plot: Population vs Mortality Rate
plt.figure(figsize=(10,6))
sns.scatterplot(
data=merged,
x="TotalPopulation",
y="Mortality_per_100k",
hue="Country",
s=120
)
plt.xscale("log")
plt.title("Relationship Between Population Size and Alcohol-Related Mortality Rate")
plt.xlabel("Total Population (log scale)")
plt.ylabel("Mortality per 100,000 population")
plt.legend(bbox_to_anchor=(1,1))
plt.show()
*Answer to research question: This scatter plot examines how population size relates to alcohol-related mortality rates. Using a log scale highlights the wide variation in country populations.
The results show that small-population countries have disproportionately high mortality rates, with the Marshall Islands appearing as a clear outlier. In contrast, larger countries consistently show much lower and more stable mortality rates (generally under 5 deaths per 100,000).
Overall, the visualization suggests that alcohol-related mortality burden is highest in very small populations, where even small increases in deaths translate into large per-capita rates.
5.2: ReflectionΒΆ
If I had more time, I would explore additional WHO health indicators to better understand the broader context behind alcohol-related mortality, such as hospitalization rates or alcohol consumption patterns by age group. I would also investigate more sophisticated methods for handling missing data and potential inconsistencies in country naming conventions. Finally, expanding the analysis to include multiple years instead of a single snapshot could reveal important long-term trends in mortality across different regions.