Being an inquirer and a creative thinker, I always aspired to contribute to society with the skill and knowledge I procure. I believe, real-life experience is something that genuinely motivates with an internal objective to persuade. I recently came across one of the most harmful diseases called cancer as one of my neighbours recently detected. He, being a worker at a nuclear power plant, doctors have assumed that leakage of radiation was one reason behind cancer. The statement claimed by the doctor has raised several curiosities in our mind. Does working in a nuclear power station causes cancer? Does the age of nuclear power plant employees increase the chance of getting infected by cancer? To derive the answers to the questions, I have done a few research. I have read a few research journals on cancer and medical science, which has enabled me to understand different cancer causative agents.
Understanding several causes of cancer, I have tried to explore the probability of getting infected by cancer based on one of the most significant nuclear power plant parameters, i.e., the number of working employees. To derive a correlation between the chances of getting infected in a nuclear power station based on the total number of employees, I have also researched different correlation coefficients to justify the derived correlation. In the process, I have learnt the use of Pearson’s Correlation Coefficient, which is an extension of the regression correlation coefficient that I have studied in the curriculum of IB.
After all of these researches, I have come to the research question of this exploration intending to find the chance of getting infected by cancer if a person is working in a nuclear power plant with a more significant number of employees than that of a nuclear power plant with less number of employee.
This exploration's prime objective is to derive a relationship on chances of getting infected by cancer for a worker of an Atomic Power Station and the total number of working professionally in the power station.
To what extent is there a correlation for three different age groups of individuals (Gr 1: 30 years to 45 years, Gr 2: 45 years to 60 years, and Gr 3: 60 years to 75 years) between the number of workers getting infected by Cancer during the period of their service as well as after retirement from job in different Atomic Power Plants in the United States of America and the total number of workers working in the Atomic Power Plant?
Atomic power plant uses the process of nuclear fission to generate energy. It is performed in nuclear reactors where heat is generated which is further used to generate electricity. During the process, several radiations, such as, α - rays, β - rays, γ - rays and many more are emitted. Amongst the mentioned rays, the most harmful radiation is the γ ray. Though many precautions are taken in atomic power plants to prevent leakage of radiations; however, cases of radiation leakage are observed which invariably affect human life and environment.
Regression correlation coefficient provides information about the stability of any obtained correlation between a dependent variable and its corresponding independent variable. The magnitude of the coefficient lies between 0 and 1. Here, the correlation's maximum strength is denoted by 1, whereas, a minimum strength of correlation or no correlation is represented by 0. The mathematical formulation of the regression correlation coefficient for a linear trend is shown below:
\(r^2=\bigg[\frac{n\big(\sum xy\big)-(\sum x)(\sum y)}{\sqrt{[n\sum x^2-\big(\sum x\big)^2][n\sum y^2-\big(\sum y\big)^2}]}\bigg]^2\)
x = independent variable
y = dependent variable
r2 = regression correlation coefficient
n = number of observations
Pearson’s correlation coefficient provides information about the stability and the nature of any obtained correlation between a dependent variable and its corresponding independent variable. The magnitude of the coefficient lies between -1 and 1. Here, the maximum strength of the correlation is denoted by the value of ±1, whereas, a minimum strength of correlation or no correlation is represented by 0. A positive value of Pearson’s Coefficient signifies that the relationship is increasing in nature, and that of a negative value indicates that the relationship is decreasing in nature. The mathematical formulation of Pearson’s correlation coefficient for a linear trend is shown below:
\(R=\frac{\sum(x-\bar x)(y-\bar y)}{\sqrt{\sum(x-\bar x)^2\times\sum(y-\bar y)^2}}\)
x = independent variable
y = dependent variable
R = Pearson's correlation coefficient
\(\bar x = \,mean \,value \,of \,all \,observations \,of \,the \,independent \,variable\)
\(\bar y = \,mean \,value \,of \,all \,observations \,of \,the \,dependent \,variable\)
In this exploration, ten central atomic power stations in the United States of America are chosen. The total number of employees, currently working or have worked in each organisation, has been collected from three different age groups, as mentioned in the research question. The total number of workers infected by cancer during their tenure of service or after retirement is based on each age group and the atomic power station. To verify the collected data's stability, the percentage of infected employees for each nuclear power station has been calculated based on their organisation. Finally, the correlation between the number of infected employees of each age group and each power station has been plotted compared to the total number of employees working or worked in the corresponding power station. To verify the correlation, regression correlation coefficient and Pearson's correlation coefficient has been calculated, and the correlation is evaluated using T-Test.
It is assumed that no correlation is obtained between the number of employees getting infected by Cancer during the period of their service as well as after retirement from the job in different Nuclear Power Plants in the United States of America and the total number of employees working in the Nuclear Power Plant.
It is assumed that a correlation is obtained between the number of employees getting infected by Cancer during the period of their service as well as after retirement from the job in different Nuclear Power Plants in the United States of America and the total number of employees working in the Nuclear Power Plant.
Data table -
Name | Total | Infected | Percentage |
---|---|---|---|
Rochester City Project | 328 | 33 | 10.06 |
Chicago City Project | 348 | 36 | 10.34 |
San Diego City Project | 386 | 42 | 10.88 |
Newark City Project | 452 | 72 | 1.593 |
Texas City Project | 458 | 53 | 11.57 |
Dayton City Project | 673 | 88 | 13.08 |
Virginia City Project | 724 | 102 | 14.09 |
Utah City Project | 977 | 177 | 18.12 |
Boston City Project | 1563 | 301 | 19.26 |
Austin City Project | 3874 | 878 | 22.66 |
Sample Calculation:
Percentage of Infected Worker in Rochester City Project
\(= \frac{33}{328} = 10.06\)
Graphical Analysis:
The above graph represents the relationship between the number of employees aged between 30 and 45 who are infected by cancer during their tenure of service at different Nuclear Power Plants in the USA. The total number of employees working in various power plants, being the independent variable of the exploration, is plotted along the X-Axis. The cancer-infected employees out of the total working employees, being the dependent variable of the investigation, are plotted along the Y-Axis. The total number of employees working in power plant increases from 328 to 3874; the number of individuals infected by cancer increases from 33 to 878. Hence, an increasing linear trend has been obtained in the graph, i.e., with an increase in the number of workers in each power plant, the number of employees getting infected by cancer increases. The equation of trend obtained in the graph is shown below:
y = 0.2386x - 54.366
Here, x represents the total number of employees working in different power plants, and y represents cancer infected employees out of the entire working employees.
Despite having a very high value of the regression coefficient of 0.99, the data set itself questions the correlation's reliability because there is a vast gap in the total number of employees working in the nuclear power plant (independent variable) between 1600 and 3800. As the dependent variable's values for the corresponding range of independent variable are not available, the correlation cannot be said to be reliable.
Calculation of Regression Coefficient -
In the processed data table, total number of employees working in nuclear power plant is denoted by x, and the number of employees infected by cancer is denoted by y, and ∑ denotes the summation.
x | y | x2 | Y2 | xy |
---|---|---|---|---|
328 | 33 | 107584 | 1089 | 10824 |
348 | 36 | 121104 | 1296 | 12528 |
386 | 42 | 148996 | 1764 | 16212 |
452 | 72 | 204304 | 5184 | 32544 |
458 | 53 | 209764 | 2809 | 24274 |
673 | 88 | 452929 | 7744 | 59224 |
724 | 102 | 524176 | 10404 | 73848 |
977 | 177 | 954529 | 31329 | 172929 |
1563 | 301 | 2442969 | 90601 | 470463 |
3874 | 878 | 15007876 | 770884 | 3401372 |
Σx = 9783 | Σy = 1782 | Σx2 = 20174231 | Σy2 = 923104 | Σxy = 4274218 |
Figure 3 - Table On Processed Data For Calculation Of R2 For Group 1
Calculation:
\(r^2=\bigg[\frac{n(Σxy)-(Σx)(Σy)}{\sqrt{[nΣx^2-(Σx)^2][nΣy^2-(Σy)^2]}}\bigg]\)
\(=>r^2=\bigg[\frac{10(4274218)-(9783)(1782)}{\sqrt{[10×20174231-(9783)^2}][10×923104-(1782)^2]}\bigg]^2\)
=> r2 = (0.9987)2 = 0.9975
Calculation of Pearson’s Correlation Coefficient -
In the processed data table, total number of employees working in nuclear power plant is denoted by x, and the number of employees infected by cancer is denoted by y, \(\bar x\) denotes the average number of workers those are working in nuclear power plant, \(\bar y\) denotes the average number of workers those are infected y cancer, and ∑ denotes the summation.
x | y | \(x-\bar x\) | \(y-\bar y\) | \((x-\bar x)(y-\bar y)\) | \((x-\bar x)^2\) | \((y-\bar y)^2\) |
---|---|---|---|---|---|---|
328 | 33 | -650.30 | -145.20 | 94423.56 | 422890.09 | 21083.04 |
348 | 36 | -630.30 | -142.20 | 89628.66 | 397278.09 | 20220.84 |
386 | 42 | -592.30 | -136.20 | 80671.26 | 350819.29 | 18550.44 |
452 | 72 | -526.30 | -106.20 | 55893.06 | 276991.69 | 11278.44 |
458 | 53 | -520.30 | -125.20 | 65141.56 | 270712.09 | 15675.04 |
673 | 88 | -305.30 | -90.20 | 27538.06 | 93208.09 | 8136.04 |
724 | 102 | -254.30 | -76.20 | 19377.66 | 64668.49 | 5806.44 |
977 | 177 | -1.30 | -1.20 | 1.56 | 1.69 | 1.44 |
1563 | 301 | 584.70 | 122.80 | 71801.16 | 341874.09 | 15079.84 |
3874 | 878 | 2895.70 | 699.80 | 2026410.86 | 8385078.49 | 489720.04 |
Calculation -
\(\bar x=\frac{Σx}{N}=\frac{9783}{10}=978.3\)
\(\bar y=\frac{Σy}{N}=\frac{1782}{10}=178.2\)
\(Σ(x-\bar x)(y-\bar y)=2530887.40\)
\(Σ(x-\bar x)^2=10603522.10\)
\(Σ(y-\bar y)^2=605551.60\)
\(R=\frac{Σ(x-\bar x)(y-\bar y)}{\sqrt{Σ(x-\bar x)^2×Σ(y-\bar y)^2}}\)
\(R=\frac{2530887.40}{\sqrt{10603522.10×605551.60}}=0.998\)
Evaluation by T – Test -
In the calculation shown below, the total number of employees working in nuclear power plant is denoted by x, and the number of employees infected by cancer is denoted by y, \(\bar x\) denotes the average number of workers those are working in nuclear power plant, \(\bar y\) denotes the average number of workers those are infected y cancer, nx represents the number of observation of total number of working employee (independent variable), ny represents the number of observation of cancer infected employee (dependent variable) and S is an estimator of pooled variance which is defined as follows:
\(S=\frac{Σ(x-\bar x)^2+Σ(x-\bar y)^2}{n_x+n_y-2}\)
The mathematical formulation of T – Value is also shown below:
\(T\ value=\frac{|\bar x-\bar y|}{\sqrt{\frac{S^2}{n_x}+\frac{S^2}{n_y}}}\)
For calculation of T – Value required for this test, Table 1 has been followed:
\(\bar x=\frac{9783}{10}=978.3\)
\(\bar y=\frac{1782}{10}=178.2\)
\(S^2=\frac{Σ(x-\bar x)^2+Σ(x-\bar y)^2}{n_x+n_y-2}=178.2\)
\(=\frac{(328-978.3)^2+...+(3874-978.3)^2+(328-178.2)^2+...+(3874-178.2)^2}{10+10-2}\)
= 1533813.57
\(T\ value=\frac{|978.3-178.2|}{\sqrt{\frac{1533813.57}{10}+\frac{1533813.57}{10}}}=\frac{800.1}{553.86}=1.44\)
Comparing the T – Value with respect to the values in T – Table, it can be stated that the Alternate Hypothesis is true.
Data Table:
Name | Total | Infected | Percentage |
---|---|---|---|
Rochester City Project | 333 | 36 | 10.81 |
Chicago City Project | 344 | 38 | 11.05 |
San Diego City Project | 378 | 57 | 15.08 |
Newark City Project | 462 | 99 | 21.43 |
Texas City Project | 486 | 102 | 20.99 |
Dayton City Project | 620 | 114 | 18.39 |
Virginia City Project | 797 | 144 | 18.07 |
Utah City Project | 971 | 160 | 16.48 |
Boston City Project | 1497 | 297 | 19.84 |
Austin City Project | 3388 | 790 | 23.32 |
Sample Calculation:
Refer to the Sample Calculation shown for Table No. 1.
Graphical Analysis: