This analysis is on suicide rates from 1985 to 2016 in a set of about 100 countries with additional information on age, gender and economic status of that country. The data is available from Kaggle.
This post here is a slightly shortened version of the full analysis that can be found in the repository on Gitlab.
library("dplyr")
library("ggplot2")
suicide <- readRDS("suicide.Rds")
As always some data cleaning and supplementing was undertaken before beginning to analyse data. Data cleaning was however minor and included:
The final data looks like this:
suicide
## # A tibble: 25,752 x 12
## country year sex age suicides_no population suicides_100k_p…
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 Albania 1987 male 15-2… 21 312900 6.71
## 2 Albania 1987 male 35-5… 16 308000 5.19
## 3 Albania 1987 fema… 15-2… 14 289700 4.83
## 4 Albania 1987 male 75+ … 1 21800 4.59
## 5 Albania 1987 male 25-3… 9 274300 3.28
## 6 Albania 1987 fema… 75+ … 1 35600 2.81
## 7 Albania 1987 fema… 35-5… 6 278800 2.15
## 8 Albania 1987 fema… 25-3… 4 257200 1.56
## 9 Albania 1987 male 55-7… 1 137500 0.73
## 10 Albania 1987 fema… 5-14… 0 311000 0
## # … with 25,742 more rows, and 5 more variables: country_year <chr>,
## # gdp_for_year <dbl>, gdp_per_capita <dbl>, generation <chr>,
## # continent <chr>
First, let’s see for which countries data is available.
data_countries <- suicide %>%
select(country) %>%
distinct() %>%
mutate(country = countrycode::countrycode(
country, origin = "country.name", destination = "iso.name.en")) %>%
tibble::deframe()
world_map <- rnaturalearth::ne_countries(returnclass = "sf") %>%
select(admin, geometry) %>%
mutate(admin_iso = countrycode::countrycode(
admin, origin = "country.name", destination = "iso.name.en")) %>%
# kosovo turns NA (likely as it is not universally recognised as independent)
mutate(admin_iso = tidyr::replace_na(admin_iso, "Kosovo")) %>%
mutate(in_data = ifelse(admin_iso %in% data_countries,
"available", "not available"))
ggplot(world_map) +
aes(fill = in_data) +
geom_sf(colour = "black", size = .1) +
scale_fill_manual(values = c("available" = "maroon4",
"not available" = "white")) +
coord_sf(datum = NA) +
theme_minimal() +
theme(legend.position = "bottom") +
labs(title = "Countries with available data", fill = "")
The map clearly shows that for many countries of Africa and Asia no data is available. Consequently, the analysis below is not able to identify truly global patterns and caution is advised in generalising the findings too much.
It makes sense to start with the simplest way of looking at the data first: globally over time.
by_year <- suicide %>%
group_by(year) %>%
summarise(mean = mean(suicides_100k_pop),
se = sd(suicides_100k_pop) / sqrt(n()))
plot_breaks <- seq(min(by_year$year), max(by_year$year), 3)
ggplot(by_year) +
aes(x = year, y = mean, ymin = mean - se, ymax = mean + se) +
geom_ribbon(alpha = .1) +
geom_line() +
geom_point() +
expand_limits(y = 0) +
labs(x = "Year", y = "Suicides per 100,000 people",
title = "Global average suicide rate",
subtitle = "The shaded area shows the standard error") +
scale_x_continuous(breaks = plot_breaks) +
theme_minimal()
From this plot it seems like there is no overall trend in global suicide rates in the time frame available in the data. But we should take a closer look at the data.
Let’s identify the countries which have the highest suicide rates in any of the years available in the data.
highest_rate_year <- suicide %>%
group_by(year, country) %>%
summarise(mean_100k = mean(suicides_100k_pop)) %>%
slice(which.max(mean_100k))
ggplot(highest_rate_year) +
aes(x = year, y = mean_100k, fill = country, label = country) +
geom_bar(stat = "identity") +
geom_text(size = 2.5, angle = 90, aes(y = 10), hjust = 0) +
guides(fill = FALSE) +
scale_x_continuous(breaks = c(
min(highest_rate_year$year, na.rm = TRUE):
max(highest_rate_year$year, na.rm = TRUE))) +
labs(title = "Countries with highest suicide rate",
y = "Suicides per 100,000 people",
x = "Year") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
It appears like Lithuania and South Korea are the two countries that tend to have the highest suicides rates for the countries available in the data.
And it is probably interesting to have a look at how the countries in this subset of the data develop over time in the available time frame:
highest_rates <- suicide %>%
filter(country %in% highest_rate_year$country) %>%
group_by(country, year) %>%
summarise(mean = mean(suicides_100k_pop))
ggplot(highest_rates) +
aes(x = year, y = mean, colour = country) +
geom_line() +
geom_smooth(alpha = .2) +
geom_point() +
expand_limits(y = 0) +
scale_x_continuous(breaks = plot_breaks) +
labs(x = "Year", y = "Suicides per 100,000 people",
colour = "Country",
title = "Suicides in countries with highest rates") +
theme_minimal() +
theme(legend.position = "bottom")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
In this plot we can see that suicide rates in South Korea and Suriname increase a lot after the year 2000, although there is a slight trend in the opposite direction again for South Korea after 2010. Data for Hungary and Lithuania is not available for the entire time frame and rates in both countries start very high.
Given everything we know about suicides one would expect that the rates are much higher in men than in women.
by_year_and_gender <- suicide %>%
group_by(year, sex) %>%
summarise(mean = mean(suicides_100k_pop),
se = sd(suicides_100k_pop) / sqrt(n()))
ggplot(by_year_and_gender) +
aes(x = year, y = mean, group = sex, colour = sex, fill = sex,
ymin = mean - se, ymax = mean + se) +
geom_ribbon(alpha = .2, colour = NA) +
geom_line() +
geom_point() +
expand_limits(y = 0) +
scale_x_continuous(breaks = plot_breaks) +
labs(x = "Year", y = "Suicides per 100,000 people",
colour = "Gender", fill = "Gender", title = "Suicides by gender",
subtitle = "The shaded area shows the standard error") +
theme_minimal() +
theme(legend.position = "bottom")
From this plot we can see that suicide rates are generally much higher in men than in women. For women the rate remains rather constant at around 5 per 100,000 people, while for men it is at around 20 and fluctuates much more.
We can further illustrate the disparity by gender.
gender_disparity <- suicide %>%
group_by(country, sex) %>%
summarise(mean = mean(suicides_100k_pop)) %>%
tidyr::spread(sex, mean) %>%
rowwise() %>%
mutate(overall = mean(c(male, female))) %>%
ungroup()
gender_disparity %>%
arrange(overall) %>%
mutate(country = factor(country, levels = country)) %>%
ggplot() +
aes(y = country, xmin = female, xmax = male) +
geom_errorbarh(height = 0, colour = "grey80") +
geom_point(aes(x = female), colour = "red4") +
geom_point(aes(x = male), colour = "green4") +
geom_point(aes(x = overall), colour = "midnightblue", size = 0.5) +
labs(title = "Suicide gender disparity",
subtitle = "Rates per 100,000 people by gender and mean",
y = "Country",
x = "Gender Disparity") +
theme_minimal()
The plot shows the averages for the entire available time frame, so more long term trends are visualised. There seems to be a trend that the higher the overall suicide rate, the larger is the disparity in absolute numbers. Red data points indicate rates for women, green points rates for men, with the smaller blue dot showing the mean.
As the data is available, let’s also have a look at data about age groups:
# fix age sorting
by_age_group <- suicide %>%
mutate(age = stringr::str_replace(.$age, "5-14", "05-14")) %>%
mutate(age = as.factor(age)) %>%
group_by(year, age) %>%
summarise(mean = mean(suicides_100k_pop),
se = sd(suicides_100k_pop) / sqrt(n()))
ggplot(by_age_group) +
aes(x = year, y = mean, ymin = mean - se, ymax = mean + se,
colour = age, fill = age, group = age) +
geom_ribbon(alpha = .2, colour = NA) +
geom_line() +
geom_point() +
scale_x_continuous(breaks = plot_breaks) +
labs(x = "Year", y = "Suicides per 100,000 people",
title = "Suicide rates in different age groups",
subtitle = "Shaded areas indicate the standard error of the mean",
colour = "Age", fill = "Age") +
theme_minimal() +
theme(legend.position = "bottom")
This plot strongly suggests that the older a person is, the likelier they commit suicide. The 75+ years age group is clearly at the top, the 5-14 years group pretty much at the bottom and the rest somewhat close together in the middle.
There is information about GDP and GDP per Capita in the data. This can be used as indicator for overall wealth, but that approach is not without issue, as it does not really provide insight about an individual’s situation, but only provides a rough population average.
by_gdp_per_capita <- suicide %>%
group_by(country, year) %>%
summarise(mean = mean(suicides_100k_pop),
se = sd(suicides_100k_pop) / sqrt(n()),
continent = first(continent),
gdp_per_capita = mean(gdp_per_capita)) # workaround,
ggplot(by_gdp_per_capita) +
aes(x = gdp_per_capita, y = mean) +
geom_point(aes(colour = continent), alpha = .5) +
geom_smooth(method = "lm") +
labs(x = "GDP per capita (USD)", y = "Suicides per 100,000 people",
colour = "Continent",
title = "Suicides by GDP per capita") +
scale_x_continuous(breaks = seq(
0, max(by_gdp_per_capita$gdp_per_capita), 10000)) +
theme_minimal() +
theme(legend.position = "bottom")
There is a slight linear trend in the data, but just from the appearance I would say that GPD per capita is not exactly a driver of suicide. A quick linear model (which has issues, as the data is non-normal and heteroscedastic) confirms that there is not much of a trend:
gdp_mod <- lm(suicides_100k_pop ~ gdp_per_capita, data = suicide)
summary(gdp_mod)
##
## Call:
## lm(formula = suicides_100k_pop ~ gdp_per_capita, data = suicide)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.353 -12.112 -6.754 4.095 191.673
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.313e+01 1.606e-01 81.722 <2e-16 ***
## gdp_per_capita 9.716e-06 6.389e-06 1.521 0.128
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.08 on 25750 degrees of freedom
## Multiple R-squared: 8.98e-05, Adjusted R-squared: 5.097e-05
## F-statistic: 2.313 on 1 and 25750 DF, p-value: 0.1283
To put it in more human readable terms, we can extract the estimate from the model and multiply it by 10,000:
gdp_mod$coefficients[2] * 10000
## gdp_per_capita
## 0.09715646
A change of 0.097 in suicides per 100,000 people per 10,000 USD gdp per capita is not exactly a large effect.
Finally, let’s have a look at how suicides develop for each country and whether there are trends. For each country a linear model is created so we can then identify those with the largest weights in both increasing and decreasing direction.
country_models <- suicide %>%
# make sure data for countries are in order, so the association of models
# to countries later is correct
arrange(country) %>%
group_by(country) %>%
# apply lm to each group, i.e. country
group_map(~lm(suicides_100k_pop ~ year, data = .)) %>%
# get a nice overview ...
purrr::map(broom::tidy) %>%
# ... and paste it togehter
do.call(rbind, .) %>%
# finally add the country to the data
tibble::add_column(country = rep(unique(suicide$country), each = 2))
most_increasing <- country_models %>%
filter(term == "year") %>%
arrange(desc(estimate)) %>%
select(country) %>%
slice(1:12) %>%
tibble::deframe()
most_decreasing <- country_models %>%
filter(term == "year") %>%
arrange(estimate) %>%
select(country) %>%
slice(1:12) %>%
tibble::deframe()
First plot the development in countries with increasing rates over the available time frame.
suicide %>%
filter(country %in% most_increasing) %>%
group_by(country, year) %>%
summarise(mean = mean(suicides_100k_pop)) %>%
ggplot() +
aes(x = year, y = mean, colour = country, fill = country) +
geom_point(alpha = .4) +
geom_smooth(method = "lm", alpha = .2) +
guides(colour = FALSE, fill = FALSE) +
labs(y = "Suicides per 100,000 people", x = "Year",
title = "Most increasing suicide rates 1985-2015") +
facet_wrap(~country, ncol = 4, nrow = 3) +
theme_minimal()
This plot illustrates that from those countries with the steepest slopes indicating an increase in suicide rates, there are a few that stand out, namely Guyana, the Republic of Korea (South Korea), Suriname and Uruguay.
Of course, increasing suicide rates for the other eight countries for which data is shown here is a reason for concern, but the trend is much smaller.
suicide %>%
filter(country %in% most_decreasing) %>%
group_by(country, year) %>%
summarise(mean = mean(suicides_100k_pop)) %>%
ggplot() +
aes(x = year, y = mean, colour = country, fill = country) +
geom_point(alpha = .4) +
geom_smooth(method = "lm", alpha = .2) +
guides(colour = FALSE, fill = FALSE) +
labs(y = "Suicides per 100,000 people", x = "Year",
title = "Most decreasing suicide rates 1985-2015") +
facet_wrap(~country, ncol = 4, nrow = 3) +
theme_minimal()
Contrary to the plot with increasing trends, we do see primarily rather steep trends here. These are the twelve steepest downward slopes, so there may be a few more countries with similarly decreasing suicide rates. However, we can see that there are quite a few countries included in this plot for which data is not available for the entire data range, so some of the results should be interpreted with caution.
The available data is somewhat limited and suicide is a rather complex issue. In-depth conclusions on suicide can not be drawn from the data that is available, but some patterns could be identified: