Mathematics AI SL's Sample Internal Assessment

Mathematics AI SL's Sample Internal Assessment

To what extent does the percentage of internet users and GDP correlate?

5/7
5/7
10 mins read
10 mins read
Candidate Name: N/A
Candidate Number: N/A
Session: N/A
Word count: 1,910

Table of content

Motivation & Background

The following investigation aims to explore the relationship between changes in the percentage of internet users and GDP (gross domestic product) over 10 years in different countries. The topic investigates how the economic status/productivity of a country as a whole, affects each of the individual's use of the internet. I am personally interested in economics and how it affects human behavior because of the impact finance has on the way we live our lives. Especially when it concerns how human beings' consumption patterns change according to the development around them. The use of technology has been growing everywhere in the world, affecting our lives in various ways. For example, many have gained the opportunity to work from home using the internet instead of being in an office.

 

Over the years, I have become more reliant on the internet with increased school workload and for my hairdressing business, I manage on the side. My business depends on bookings of clients and bookkeeping to keep track of taxes, payments, revenue, etc. I have learned to do all of it online, and as a consequence, my dependency on the internet has increased significantly. Additionally, being a student that has lived in different countries, I see how individuals use the internet differently for varying reasons, which I find quite intriguing.

 

During the time I was still a toddler, between 2003-2009 I lived in Denmark, Sweden, and France at different time periods. The internet was not so much a part of my life, the phones were different, there weren't apps like we have today, and the computers didn't have all the infinite features the computers of today have. Nowadays, everything is online, we are able to pay, transfer money, send emails, make phone calls, order food, cabs, watch movies, read books and the list goes on.

 

This has changed significantly, for example, in 2012, I moved to Brazil, which is a country still in development, regardless of that, most of the people around me had access to the internet, most of them used it for entertainment and communication but not so much for online work. Whilst now in 2021, I have seen how many have started online businesses and working from home due to the development of countless apps, development of online chats and facetime as well as the pandemic of course. This has allowed higher productivity and higher efficiency. These changes have caught my attention, and have made me interested in exploring: To what extent is the percentage of internet users correlated with the GDP of a country? I want to explore whether a country's "development" has an impact on individual consumption of the internet, to understand how globalized this type of usage is.

Definitions

When considering the percentage of internet users, I have looked at internet penetration as a percentage in different countries. Internet penetration refers to the number of internet users in a country, in relation to their demographic data. This includes any type of access individuals have through the wireless or mobile connection, hotspots, dial-ups, broadband, satellite to access information, electronic mail, or content only offered only by the internet. As an example, social media such as Instagram or Twitter are a form of internet communication.

 

GDP measures the total amount of goods and services produced in an economy over a period of time, it reveals much about the health of a country's economy. As the GDP increases, so does the economy, this is called economic growth. This usually indicates higher average incomes, lower unemployment among others, but the main factor that is relevant for this investigation is the technological breakthrough, which normally would generate higher access to the internet. On the contrary, as GDP decreases, so does the economy, this is called negative economic growth, resulting in consequences opposite to the ones deriving from economic growth.

Important consideration

In a research published by the organization Internet Society in 2017, it is discussed how broadband internet is "unevenly distributed". The paper discusses how people from developed countries are four times more likely to have internet access than those in underdeveloped countries. This is significant information to consider as GDP is an indicator of how developed a country is, which can be a guide to find an estimate for the percentage of internet users of different countries. The different types of development are divided into three categories: LEDCs, MEDCs, NICs. LEDCs stand for lower economically developed countries, MEDCs stand for more economically developed countries, and NICs stand for newly industrialized countries. NICs refer to those countries whose national economy has converted from being based in agriculture to producing goods through industries, such as manufacturing, mining, etc.

Aim and Approach

In this investigation, I will compare secondary data, specifically numerical data. I will be exploring the correlation between values of GDP (in US dollars) and the percentage of internet users (as a percentage of the total population) of 100 different countries in 2020. These countries have different levels of development, they are LEDCs, MEDCs, and NICs.

 

In order to do this, I will first collect all GDP values in 2020 of the 100 countries, which will be our x values, as well as the percentage of internet users of the same year for the same countries, which will be our y value. These numbers will be placed on a table where I will calculate the squared value of each variable and the product of both variables. All of these calculations will be done with the help of technology. Finally, the values of each of these numbers will be added in order to find the sum of each row. These final values will be used to find the regression line. After having found the regression line, it will be plotted into a scatter plot diagram, to see how well the line fits into the data. Depending on the results of the scatter plot, I would potentially have to calculate outliers and exclude them from the diagram, creating a second one that is more concise. Afterwards, based on the values calculated on the table, I will be finding the value of Pearson's correlation coefficient, which will present the degree of correlation the data has and will help me reach a conclusion in relation to the topic of investigation. In order to support my findings, I will test my regression line with GDP values and percentages from the data. The aim of this exploration is to find whether countries' GDP reflects any changes to the percentage of their population that uses the internet.

Expectation & Hypothesis

Taking into account the relationship there is with the higher development of a country and technological advancement, the expectation for this investigation is to find that there is a moderate correlation between GDP and the percentage of internet users of a country. Why moderate? There could be other various reasons as to why more or fewer people use the internet in a country, the GDP could be high but the internet percentage low, due to lack of access, policies within the country, the standards of living, etc. However, my hypothesis is that there is a moderate degree of correlation, enough to support that high GDP indicates a high percentage of internet usage.

Data Collection

The table below (figure 1) shows a sample of the original table found in the appendix. The results in bold on the last row are going to be the values relevant to the following calculations. All The data is collected from trading economics and the statistics portal.

CountriesGDP in 2020 (x)Percentage of internet Users (y)x·y
Ethiopia107.651911588.52253612045.35
Venezuela47.2695.12233.50769044.014494.426
Haiti13.4233180.09641089442.86
Afghanistan19.8120392.4361400396.2
Cambodia25.2978.8639.58416209.441992.852
Gambia1.9203.6140038
Liberia2.95128.702514435.4
Guinea15.8620251.5396400317.2
Angola62.31283882.53617841744.68
Bangladesh324.2441105131.5776168113293.84
Σ72989.346230.09731384901.44778865948468.996

As shown in figure 1, the GDP (billion US dollars) in 2020 of 100 countries have been collected, as well as the percentage of internet users (as a total percentage of the population) of the same year. GDP is our x value and the percentage of internet users is our y value. After collecting all this data, I graphed the GDP of the countries vs the percentage of internet users in 2020 on a scatter plot diagram, shown on the next page.

Figure 2. GDP vs Percentage of internet users

A scatter plot diagram shows the extent of a correlation between the values observed. The dots represent data points, the x-axis the percentage of internet users, and the y axis represents GDP. Based on what is being portrayed, there is a high concentration of data points on the left side of the diagram, and very few on the right side. Remember that what is expected to be found, is the higher the GDP value, the higher the percentage of internet users, which is not seen in the diagram. There are low GDP values that correspond to a high percentage of internet users. This might occur due to potential globalized internet usage, meaning, the percentage of internet users in a country is independent of other factors.

Calculating Regression Line

To explore a potential relationship between the x and y variables, I will use different math calculations, starting off with finding the Linear regression line. This shows the relationship between two variables by creating a line that best fits the data. It reveals the general trend of the figures.

 

Equation: \(y = mx + b\)

12346
GDP in 2020 (x)Percentage of internet Users (y)x·y
Σ72989.346230.09731384901.4477886.60615948468.996

STEP 1: Finding slope

a) I will begin finding the slope (m) using the following formula:

 

\( m=\frac{n \Sigma xy-\Sigma x \Sigma y}{n \Sigma x^{2}-(\Sigma y^{2})} \)

 

Σxy = The sum of x·y values, this number is found on the last row of column 6.
Σx·Σy = This is the product of the sum of the x values, time the sum of the y values.
The sum of x values is found on the last row of column 1, and the sum of y on the last row of column 2.
Σx² = The sum of all x values squared found on the last row of column 3.
(Σx)² = The sum of x values (found on the last row of column 1) squared.
n = The total number of data points, which in this case would be 100.

 

b) Substituting values into formula

 

\( m=\frac{n(5948468.996)-(72989.34)(6230.09)}{n(731384901.4)-(72989.34)^{2}} \)

 

\( m=\frac{100(5948468.996)-(72989.34)(6230.09)}{100(731384901.4)-(72989.34)^{2}} \)

 

\( m=\frac{594846899.6-454730157.241}{73138490140-5327443753.64} \)

 

\( m=\frac{140116742.359}{67811046386.4} \)

 

\( m=0.002066281968 \)

 

\( 0.00207 \) (3 significant figures)

STEP 2: Finding y-intercept

a) Moving forward, I will be finding the y-intercept (b) using the following formula:

 

\( b=\frac{\Sigma y-m \Sigma x}{n} \)

 

Σy = This is the sum of all the y values, found on the last row of column 2.
Σx = This is the sum of all the x values found on the last row of column 1.
m = The slope found on step A. It will be multiplied by the sum of x.
n = Total number of data points, which in this case is 100.

 

b) Plugging values into the formula

 

\( b=\frac{6230.09-0.002066281968(72989.34)}{100} \)

 

\( b=\frac{6230.09-150.816557098}{100} \)

 

\( b=\frac{6079.2734429}{100} \)

 

\( b=60.792734429 \)

 

\( 60.8 \) (3 significant figures)

STEP 3: Substituting into formula

a) In order to create the line of regression, I will substitute the values of the slope (m) and the value of the intercept (b) into the regression line formula.

 

NOTE: the numbers will be rounded to 3 significant figures

 

\( y=0.00207x+60.8 \)

Figure 4. GDP vs Percentage of internet users with the regression line

The trend line (regression line) plotted above essentially does not fit the data, as it does not touch any of the data points. This is a possible indicator of no correlation. However, before jumping to conclusions, I will proceed on finding one more value that will confirm what the regression line has shown.

Calculating "r"

The r value is Pearson's correlation coefficient number that tells us to what extent the data values fit into the regression line and to what extent the x and y values are correlated. Depending on if the value of r is closer to 1 (indicating a positive slope) or closer to -1 (indicating a positive slope), the linear regression line is most likely adequate to the data. In relation to the correlation between the values, a correlation or r value that is between 0 and 0.25 shows a very weak correlation; an r value between 0.25-0.5 indicates weak correlation; an r value between 0.5-0.75 reflects Moderate correlation and an r value between 0.75-1 shows a strong correlation.

Steps to finding Pearson's Correlation Coefficient

a) Finding "r", requires the following equation:

 

\( r=\frac{n(\Sigma xy)-(\Sigma x)(\Sigma y)}{\sqrt{[n \Sigma x^{2}-(\Sigma x)^{2}][n \Sigma y^{2}-(\Sigma y)^{2}]}} \)

 

n = the number of data points, which is 100 in this case.
(Σx·y) = The sum of x multiplied by y values, found on the last row of column 6.
(Σx)·(Σy) = Sum of x values, found on the last row of column 1 multiplied by the sum of y values, found on the last row of column 2.
Σx² = Sum of all squared x values, found on the last row column 3.
(Σx)² = Sum of x values (found on the last row of column 1) squared.
Σy² = Sum of all squared y values, found on the last row of column 4.
(Σy)² = Sum of y values (found on the last row of column 2) squared.

 

b) Substituting values into the formula

 

\( r=\frac{n(5948468.996)-(72989.34)(6230.09)}{\sqrt{[n 731384901.4-(72989.34)^{2}][n 477886.6061-(6230.09)^{2}]}} \)

 

\( r=\frac{(100)(5948468.996)-(72989.34)(6230.09)}{\sqrt{[(100) 731384901.4-(72989.34)^{2}][(100) 477886.6061-(6230.09)^{2}]}} \)

 

\( r=\frac{594846899.6-454730157.241}{\sqrt{[73138490140-5327443753.64][47788660.61-38814021.4081]}} \)

 

\( r=\frac{140116742.359}{\sqrt{[67811046386.4][8974639.2019]}} \)

 

\( r=\frac{140116742.359}{\sqrt{6.08579675221E17}} \)

 

\( r=0.179610329501 \)

 

\( 0.18 \) (2 s.f)

Comment on Findings

Considering the boundaries for different values of r and the corresponding result, it is fair to say that the correlation between the two variables is weakly positive. The regression line helps to visually explain the r value found. To some extent, the regression line is upward sloping, hence the r value is positive, but it is very small, close to zero, making it a very weak correlation, close to no correlation at all. This comes as a surprising result since the hypothesis was to find a moderate correlation.

Testing the regression line

In order to, once more, confirm our results, I will be testing the regression line with a value of GDP of the United States, Bangladesh, and Italy in 2019 to find what percentage of the internet it results in, in order to compare with the real results through a percentage error calculation.

Testing regression linePercentage error
United states Actual percentage for 2019: 95%y = 0.0021(21433.22) + 60.79 y = 105.8%

\( \frac{105.8-95}{95} \)

· 100 = 11.4%
Bangladesh Actual percentage for 2019: 55%y = 0.0021(302.56) + 60.79 y = 61.4%

\( \frac{61.4-55}{55} \)

· 100 = 11.6%
Italy Actual percentage for 2019: 92%y = 0.0021(2004.91) + 60.79 y = 65%

\( \frac{65-92}{92} \)

· 100 = 29.3%

As shown on the table above, the percentage error is relatively high and inconsistent. At first, I calculated this only for the US and Bangladesh, and I got similar results, this is because the percentage calculated for both countries was close to the real percentage value for 2019. I then decided to do this calculation for one more country and the results gave a completely different percentage error. This further confirms that there is a very weak correlation, I would determine safely that there is rather, no correlation at all between the two variables.

Reflection & Evaluation

The aim of this investigation was to find whether countries' GDP reflects any changes to the percentage of their population that uses the internet. The results have shown that there is, to no extent, any correlation between those variables. Through this investigation, there were some limitations that contributed to a result that did not meet what was expected, what the hypothesis predicted. A major limitation was the chosen variables, specifically GDP. As it was lightly touched upon at the beginning of the introduction, GDP does not account for the individual's economic well being and enjoyment of the country's development, meaning, a high GDP is not necessarily an indicator of the amount of access an individual in a country is going to have to the internet.

 

This can be observed in the numbers on the data table. There are inconsistent numbers such as Brazil for example. It's considered to be an LEDC, however, it had a very high GDP in 2020 in comparison to other underdeveloped countries, along with a higher percentage of internet users. Another example is Samoa, with a very low GDP (0.81) and a relatively high percentage of internet users (66 percent), in proportion to the GDP and considering the population size. For the "more developed countries", there were also some outliers such as China; with a GDP of 14722.73 and a percentage of internet users of 59%.

 

This was the main limitation of the investigation because there could have been a correlation on an individual level, yet it was not shown because of the use of GDP. Another replacement for GDP could have been GDP per capita, as it reflects the individual economic production of residents of a country. For example, if a country's GDP per capita increases along with a stable population, it shows a potential technological growth being produced by the same level of population size.

 

As for the mathematical approaches, finding the r value was appropriate. The aim was not to find if one variable caused the other rather, it was to investigate if a change in one variable caused changes in the other. Initially, there were miss calculations made when finding the r value, resulting in a much higher value, which did not match the scatter plot diagrams, showing no correlation. This required a second round of calculations to be made, which resulted in the correct r value.

 

On the other hand, the regression line was overly unfit for the data points, this is because the spread of the data on the scatter plot was not linear. Other types of regression such as logarithmic and exponential would have captured and represented the spread of the data accurately, however, as said before, we found no correlation between the data and I would suspect that even using other regressions, the line would be the same, or at least similar.

AI Assist

Expand

AI Avatar
Hello there,
how can I help you today?