This investigation explores the investment strategy of diversification, a key process in risk management that involves building a portfolio with a variety of investments in order to manage its volatility. Essentially, it is a concept that reminds investors to never put all their eggs (investments) in one basket (Palmer 2022) in order to maximize protection against potential financial disruptions and undesirable market conditions.
Since becoming aware of the possibilities of accumulating wealth in the long term from investing at a young age, I have been looking into stock investment and am considering creating an equity portfolio as a way to set aside funds for the business that I want to start in the future, whilst gaining financial literacy skills that will benefit me later on. However, given my limited knowledge and financial resources, I knew that it wasn’t reasonable for me to create a highly diversified equity portfolio containing a wide variety of stocks. Hence, I became curious about how many stocks a portfolio has to contain that it would be at a tolerable risk level but still generated satisfactory returns.
The aim of this exploration is to determine the optimal number of stocks that will minimize the volatility of an equity portfolio for the Hong Kong stock market through modelling the relationship between the number of stocks in a portfolio and its overall risk.
To approach this investigation, I will compile equity portfolios composed of varying numbers of stocks for which I will calculate their monthly returns and the standard deviation of such returns (portfolio risk). A scatter plot will then be created using Desmos to show the general relationship between the number of stocks that the portfolio contains and its risk. Based on the pattern observed, I will firstly select an appropriate modelling function and attempt to find its unknown variables using algebra, then use technology to verify the generated function. Afterwards, I will calculate the percentage errors of those equations to determine how well they are able to model the function.
An equity portfolio can be described as a basket of stocks that an individual invests in with the expectation of generating returns in the long run. In this investigation, each equity portfolio consists of equally weighted stocks that are rebalanced on a monthly basis so to maintain their equal weights. For example, in the portfolio that contains 10 stocks, each stock is assigned a weight of 10%.
Portfolio return refers to the gains or losses generated by an investment portfolio over a given period of time. As it is the weighted average of stock return from each individual component within the portfolio, it can be calculated by multiplying each investment’s return and weight, and adding those values together.
Risk, otherwise known as ‘volatility’ is measured by the standard deviation of a variable as percentage of the mean (Marrison 2002) that is often expressed as an annual rate. The higher the standard deviation, the more the portfolio return will fluctuate. Whilst each investment within a portfolio carries its own risk, portfolio risk refers to the overall risk for all investments within that portfolio.
The Hang Seng Index (HSI) is a capitalization-weighted stock market index based in Hong Kong which is used to monitor changes in the largest and most liquid companies which trade via the Hong Kong Exchange on a day-to-day basis. Currently consisting of 66 constituents, it is widely accept as the primary indicator of the city’s market performance.
To begin the investigation, I filtered through the HSI to find the constituents that had a price history of at least 36 months (May 2019 to May 2022). This time period was chosen because according to the Central Limit Theorem, a theory stating that as a sample size gets larger, the distribution of sample means approximates a normal distribution (Ganti 2022), sample sizes must be equal to or greater than thirty to be considered sufficient for the distribution of the sample to be normally distributed. A normal distribution occurs when the values have been distributed evenly above and below the mean and when there are no extreme values present. Consequently, the continuous probability distribution will demonstrate symmetry around the mean and most observations will be concentrated around the central peak (Frost), thus allowing for the distribution of values to be accurately described. Additionally, the opening and closing dates were selected so that the findings of this investigation would be as relevant as possible. After downloading the available information for these constituents from the Yahoo Finance website, I first had to calculate the monthly return for each stock to calculate the overall portfolio return (Saba 1:13). However, I quickly realised that one stock had a substantial return difference from the other. Upon inspection, I discovered that the abnormal return was due to the fact that the stock went ex-dividend in that month; meaning that since the stock was traded without the value of the next dividend payment, negative returns were calculated. Although the difference in returns was not a problem within itself, my failure to use prices that did not reflect dividends resulted in the returns being inaccurate, which would have led to a flawed depiction of market movement. To overcome this issue, I went back to the website and downloaded the prices that had been adjusted for dividends instead.
The following formula to calculate the monthly returns for each stock over the aforementioned time frame is shown below where M1 represents the dividend-adjusted price at the end of Month 1, and M2 represents the same at the end of Month 2 -
\(Monthly \,Return \,=\frac{M2-M1}{M1}×100%\)
For example, the monthly return of March, 2019 for Stock 2319.HK (see Appendix A for this constituent’s historical prices) was calculated as such -
\(Monthly \,Return \,= \frac{28.597612-23.749729}{23.749729}\)
Monthly Return = 20.41%
After all stock monthly returns were calculated, I had sufficient data to start building the portfolios. I decided to create 50 portfolios (with each portfolio containing two more stocks than the last to reveal a clear trend and pattern) based on convenience and the knowledge that the sample size had to be greater than thirty as stated previously. To do so, I used the method of random sampling and assigned a number from 1 to 64 (number of stocks in HSI with available price history) to each stock. By doing so, I could repeatedly insert the formula = ROUND(RAND() ∗ 64,0) into Excel so that it would generate a portfolio for me containing my desired number of stocks. This method gave each stock an equal probability of being selected in order to reduce bias. The most significant limitation in using this method of random sampling is that there were times when Excel would generate the same number (stock) more than once in a portfolio as shown in Figure 1 where Stock 42 appears twice and Stock 47 appears three times. Considering that the portfolios are supposed to be equally weighted, having duplicated stocks would result in some being double, or even triple weighted, and would not provide realistic results. However, as my Excel skills were not sophisticated enough to set a formula to avoid this problem, I had to ensure to look over the samples generated for each portfolio which was extremely time-consuming.
Once the portfolios were generated, in order to calculate the risk, I had to first calculate the monthly returns for each portfolio over a 36 month period using the aforementioned formula, then calculate the standard deviation of the monthly returns. Standard deviation is an indicator of volatility because the way it refers to how dispersed the data is from the mean is parallel to how much the price of an asset spreads from its average price (Fidelity). There are two types of standard deviation: population and sample. Although both methods are able to measure the distribution of values within a set, the population standard deviation is a parameter calculated from every individual in the entire population whereas sample standard deviation is only calculated from several of the individuals in the population and relevant when values are a sample of a larger population (Taylor 2019). As the portfolio I had generated contained the monthly returns for stocks from May 2019 to May 2022 and is therefore a sample of the entire possible return history of the constituent, using the sample standard deviation formula would be of more relevance.
Sample standard deviation can be calculated using the following formula where σ represents the sample standard variation (i.e. volatility) , n represents the number of monthly returns for the portfolio, xi represents portfolio return for month i, and x̅represents the mean of the monthly returns over the 36 months.
\( σ =\sqrt\frac{Σ^n_i=1(x_i-\bar{x})^2}{n-1}\)
Although this formula was used to calculate the standard deviation for all the portfolios, I did not think that it would be reasonable to calculate it by hand since I had too much data. Instead, I broke the formula down into a series of smaller steps for which I set up formulas for on Excel (Bhandari 2020). This method was used to calculate the standard deviation and risk for all portfolios. As a sample calculation, I have illustrated how I calculated the risk for the portfolio containing two stocks (see Appendix B for this portfolio) in Figure 2.
A - The values in this column are equally weighted average returns of the constituent stocks in the portfolio. They were calculated by adding all the monthly returns for each stock and dividing them by the number of stocks the portfolio contains.
B - The mean was calculated by first finding the sum of all returns, then dividing it by 36 (the sample size).
\(= σ\sqrt\frac{Σ^n_i=1(x_i-\bar{x})}{n-1}\)
C - For each individual value (1-36), the mean was subtracted and the result was squared.
\(= σ\sqrt\frac{Σ^n_i=1(x_i-\bar{x})}{n-1}\)
D - The sigma of all values (from 1 to n where n = 36) was then calculated.\(= σ\sqrt\frac{Σ^n_i=1(x_i-\bar{x})}{n-1}\)
E - This is the standard deviation of the portfolio expressed as a monthly term. The value from the previous column was divided by 35 because 1 had to be subtracted from n.
F - This is the final portfolio risk which was calculated by annualising3 the standard deviation. As the standard deviation was calculated based on monthly returns, it was multiplied by the square root of 12.
After calculating the portfolio risk for all portfolios, I created the following table of values (Figure 3) where x is defined as the number of stocks that a portfolio is made up of and y is defined as the portfolio risk (expressed as a percentage).
To observe the relationship between the x and y values, I then inserted the data points into the mathematics software ‘Desmos’ – the plotted graph produced can be seen in Figure 4.
Based on Figure 4, the following observations can be made -
It is an exponential function - this appears to be the best mathematical model that can be used to represent the graph based on the assumptions below. The parent function is y = e x , but with transformations, may also be expressed in the form of y = aebx + c in which the function is vertically stretched by a factor of a, horizontally stretched by a factor of \(\frac{1} {b },\) and vertically translated by c units.
The function represents a decay - the trend falls at a decreasing rate as x gets larger. From this, it can be thought that the coefficient of x is a negative value which makes the function y = ae−bx + c where b > 0.
The vertical stretch factor is positive - because all the points are greater than zero, there has not been a vertical reflection over the y-axis from the parent function. This means that a is a positive integer.
The horizontal asymptote is 17 - as y decreases, x can be seen approaching but never touching 17. Therefore, it can be inferred that c, the variable which denotes a vertical translation, is equal to 17.
In accordance with the fourth assumption, it can be hypothesized that the equation used to model the function in Figure 4 will be y = ae−bx + 17.
To find the variable a and b, I will create two equations by substituting two randomly selected points from Figure 3 into the equation. With the two equations, I will solve for systems of equations through the method of substitution – these calculations can be seen below. Coordinates to be substituted into Equation 1 - (10, 24.35) Coordinates to be substituted into Equation 2 - (32, 17.70)
Equation 1 -
24.35 = ae −10b + 17
ae −10b = 7.35
\(e −10b =\frac{7.35}{a}\)
\(e−10b =(e^{-32b})\frac{10}{32}\)
\(\frac{7.35}{a}=(\frac{0.70}{a}) \frac{10}{32}\)
\(=\frac{0.894526}{a\frac{10}{32}}\)
\(a\frac{-22}{32}=\frac{0.894526}{7.35}\)
a = 21.4025
\(-10b =In(\frac{7.35}{21.4025})\)
\(b = \frac{\ln\left(\frac{7.35}{21.4025}\right)}{10} \)
b = 0.106881
∴ a = 21. 40
b = 0. 107
Equation 2 -
17.70 = ae−32b + 17
ae−32b = 0.70
\(e^{-32b} = \frac{0.70}{a} \)
Substituting the values of a and b into the original function produces the equation y = 21. 40e−0.107x + 17. This function is modelled by the purple curve as shown below in Figure 5.
In order for this graph to make sense in the context of equity portfolios, the domain needs to be x > 0, x ∈ R since it is not possible for an equity portfolio to contain zero, let alone a negative number of stocks. As mentioned previously, because the horizontal asymptote is 17, the range of the function is y > 17, y ∈ R. It is also worth mentioning that the graph is unrealistic in the sense that the curve is continuous, as too many stocks in a portfolio can become overdiversified which may take away the impact of significant stock gains.
Using the functions available on Excel, I was able to generate the following equation - y = 28. 651e−0.131x + 17. This function is modelled by the red curve as shown below in Figure 6. For this graph, the domain and range should also be x > 0, x ∈ R and y > 17, y ∈ R respectively.
For comparison purposes, the functions derived algebraically and through the use of have been graphed on the same diagram as shown below in Figure 7.
In order to determine which of the generated equations better fits the actual trend of the function based on the table of values in Figure 3, the difference in the coordinates of points for both functions must be calculated. This discrepancy is known as the percentage error.
To calculate the percentage error, the formula shown below is used where vA is defined by the true portfolio risk (the y coordinates from Figure 3) and vE is defined by the portfolio risk deduced from substituting the x coordinates into the generated equations.
\( \text{Percentage Error (%)} = \frac{\left| vA - vE \right|}{vE} \times 100 \)
This formula was used to calculate the percentage error of each portfolio modelled by the different equations, this can be seen in Appendix C. As a sample calculation, I will demonstrate how I calculated the percentage error for the portfolio containing 50 stocks from the algebraically derived model (y = 21.40e −0.107x + 17).
\(Percentage Error = \frac{\left| 17.06-17.10 \right|}{17.10}× 100\)
\(Percentage Error = \frac{\left| -0.04 \right|}{17.10}× 100\)
Percentage Error = 0.002339 × 100
Percentage Error = 0.23%
After the percentage errors for each portfolio was calculated, the mean of all percentage errors was found to see if the overall function accurately modelled the relationship between the number of stocks in the portfolio and the portfolio risk. This was repeated for the equation derived from technology as well (y = 28.651e −0.131x + 17).
Mean of percentage errors for
y = 21.40e −0.107x + 17= 0.68%
Mean of percentage errors for
y = 28.651e−0.131x + 17 = 0.33%
If the estimated or modelled data points prove to be a good fit for the actual data points, the percentage error should be as close to zero as possible. Based on the mean of the percentage errors, it can be observed that the function generated using technology is more accurate than the one generated algebraically in modelling the pattern; the two values have a difference of 0.35%. Whilst 0.35% may not seem like a large percentage, it may be of great significance when discussing the risk of a portfolio. Therefore, although the equation y = 21.40e −0.107x + 17 may be able to model the general relationship, it lacks accuracy which makes it less relevant compared to y = 28.651e −0.131x + 17. Nonetheless, using the mean of percentage errors as a measure has the disadvantage of being biased values are close to zero as it will result in corresponding percentage errors being exceptionally high. Similarly, when the value is zero, the mean of percentage errors will be undefined altogether.
Overall, it can be concluded that the higher the number of stocks in the portfolio, the lower the portfolio’s volatility until a plateau is reached. In the case of this investigation, a proportional relationship between the two variables is only present up to when a portfolio contains 28 stocks. From 28 onwards, regardless of how many more stocks are added to the portfolio, the degree to which the risk will change is minimal and it will remain at around 17%. This implies that to obtain the lowest portfolio volatility possible, one’s portfolio must consist of at least 28 stocks which aligns with research advising investors to diversify their portfolios so that it holds 20 to 30 stocks (Swenson 2022). This is important to understand in order to avoid compiling a bloated portfolio which will no longer reap the benefits of diversification.
Despite the insight gained from this investigation, it has limitations which may hinder how meaningful it is. For example, this study is only beneficial to those who equally weight their equity portfolio since the portfolios were generated with the assumption that all stocks have the same worth. Though equally weighted portfolios historically produce returns which are higher whilst keeping risk at a minimum, many investors still choose to assign a certain weight to each stock based on value.
Personally speaking, the most valuable takeaway from the investigation is the importance of a diverse portfolio and how an equity portfolio must contain at least 28 stocks. Another observation that resonated with me was how little the increments in which the portfolio risk decreased with the addition of more stocks. As a high school student with limited savings and no source of income, it might not be worth it to invest in that many stocks at this stage since the changes to the volatility will be minimal either way. Furthermore, it is likely that with each stock added to the portfolio, a great deal of extra time and effort will be required to monitor its trends and fluctuations which some people might not be able to afford; this can be considered a downside of having many stocks in one equity portfolio.
Beers, Brian. "How Is Standard Deviation Used to Determine Risk?" Investopedia, 18 Apr. 2022, www.investopedia.com/ask/answers/021915/how-standard-deviation-used-determine- risk.asp. Accessed 18 June 2022.
Bhandari, Pritha. "How to Calculate Standard Deviation (Guide) | Formulas & Examples." Scribbr, 17 Sept. 2020, www.scribbr.com/statistics/standard-deviation/. Accessed 3 Jan. 2023. Chen, James. "Portfolio Return." Investopedia, 19 July 2020, www.investopedia.com/terms/p/portfolio-return.asp. Accessed 18 June 2022.
"Diversification Ratio for Portfolio Management (Excel)." Youtube, uploaded by NEDL and Saba, 24 May 2021, www.youtube.com/watch?v=3vzPnRMqJiw. Accessed 3 Jan. 2023.
Fidelity. "Standard Deviation." Fidelity, www.fidelity.com/learning-center/trading- investing/technical-analysis/technical-indicator-guide/standard- deviation#:~:text=Standard%20deviation%20is%20the%20statistical,value%20that%20indica tes%20low%20volatility. Accessed 3 Jan. 2023.
Fischer, Jan. "What the Mape Is FALSELY Blamed For, Its TRUE Weaknesses and BETTER Alternatives!" statworx, 16 Aug. 2019, www.statworx.com/en/content-hub/blog/what-the- mape-is-falsely-blamed-for-its-true-weaknesses-and-better-alternatives/. Accessed 3 Jan. 2023.
Frost, Jim. "Normal Distribution in Statistics." Statistics By Jim, statisticsbyjim.com/basics/normal- distribution/. Accessed 3 Jan. 2023.
Ganti, Akhilesh. "Central Limit Theorem (CLT): Definition and Key Characteristics." Investopedia, June 2022, www.investopedia.com/terms/c/central_limit_theorem.asp#:~:text=Key%20Takeaways- ,The%20central%20limit%20theorem%20(CLT)%20states%20that%20the%20distribution% 20of,for%20the%20CLT%20to%20hold. Accessed 3 Jan. 2023.
Hang Seng Indexes. "Hang Seng Indexes Announces Index Review Results." Hang Seng Indexes, 20 May 2022, www.hsi.com.hk/static/uploads/contents/en/news/pressRelease/20220520T000000.pdf. Accessed 18 June 2022.
Lewis, Nigel Da Costa. Market Risk Modelling: Applied Statistical Methods for Practitioners. London, Risk Books, 2003.
Marrison, Chris. The Fundamentals of Risk Measurement. New York City, Mc Graw Hill, 2002.
Palmer, Barclay. "5 Tips for Diversifying Your Portfolio." Investopedia, 16 Mar. 2022, www.investopedia.com/articles/03/072303.asp. Accessed 18 June 2022.
Swenson, Sam. "How Many Stocks Should You Own?" The Motley Fool, 29 June 2022, www.fool.com/investing/how-to-invest/stocks/how-many-stocks-should-i-own/. Accessed 3 Jan. 2023.
Taylor, Courtney. "Differences Between Population and Sample Standard Deviations." ThoughtCo., 23 Jan. 2019, www.thoughtco.com/population-vs-sample-standard-deviations- 3126372#:~:text=The%20population%20standard%20deviation%20is,the%20individuals%20 in%20a%20population. Accessed 3 Jan. 2023.