Why the SP500 is better to store value than your bank (Part I) đź’¸

The current economic instability we are going through makes us think more often about our future, and whether we should save money or invest part of it.

A large percentage of the world’s population is against to invest part of their wealth, storing it in bank accounts. For instance, in the United States, who would have kept his money in a bank account 30 years ago, today would have half the purchasing power.

But, if there were any financial instrument that would provide us with annual returns capable of maintaining, or even increasing, our purchasing power over time with a very low probability of failure, would we deposit our savings, or part of them, in it?, what instruments exist with these characteristics? and, how likely are they to make us lose money?

In this article we will use mathematics to try to answer these and many other questions.

SP500 index fonds

Index funds are collective investment institutions managed passively. These financial instruments attempt to replicate a certain index (Eurostoxx 50, Dow Jones, SP 500…) buying the same percentage of each companies in the index (maintaining proportionally in their portfolio the weights that these have in the index).

The Standard & Poor’s 500 index (SP500) is one of the most important stock market indexes in the United States. This one is considered the most representative index of the real market situation. The SP500 is based on the market capitalisation of 500 large companies that are on NYSE or NASDAQ stock exchanges.

In this article we will simulate the future behaviour of the SP500 over a given period of time, answering the questions above. To get it, we are going to use several mathematical tools, together with the programming languages Python and R.

Downloading the historical data

In a Python environment, we will start downloading the SP500 history through the following piece of code. We are not going to focus on how the libraries used work as there are hundreds of tutorials on the internet about it.

Once the historical data has been downloaded, we are going to do a quick processing to save the date column as integers and calculate the relative and absolute increments of the daily closes.

As we want to study the rentability of the capital deposited in SP500 index funds, which will give us the corresponding dividends from our SP500 shares, we must adjust the growth of the index with the dividend yield provided. In this way, if we know the annual dividend yield of the SP500, we can estimate the daily dividend yield, and multiply this by the daily increase of the index, obtaining the daily increase of our capital in the index fund.

The data stored in annual_div.csv has been extracted from here.

Finally, we save the updated SP500 dataframe with the dividend yield in a csv file.

Thus, our Python script to obtain the daily closes of the SP500, updated to the dividend yield obtained, would look like this:

Simulating the SP500 behaviour

Once we have stored the data we need, we move to an R programming environment to perform the SP500 simulations. We create a script, called simulations.R, which contains the function to perform the simulations of the index.

We start creating the function call expression, which receive as main arguments the dataframe with the historical data, the number of years to simulate and the number of simulations to carry out, as well as an optional parameter for the graphical representation of the simulations, which we will detail later.

Once the function is defined, we start simulating the behaviour of the closing prices of the SP500 adjusted for dividends. To simulate the behaviour of the index over n years, we take a random day from among all possible starting days (all available days except for the last n years). Afterwards, once the starting day of the simulation is fixed, we take a replacement sample, of size n years, of the relative increases of the following n years to the starting day. This sample represents a simulation of the growth of the SP500 discounted to its dividends over n years. Then, we calculate the cumulative yield obtained by making the product of the simulated relative increments and store it annualised in the annual.yields vector. Finally, we repeat this process as many times as we want:

The last step is to return the vector with the annual returns obtained in the simulations.

In addition, we include the possibility, using the optional parameter plot, to obtain a plot of the simulations obtained. To do this, in the previous loop we will store each of the simulations in a column of a previously initialised matrix, and, once the simulated increments have been obtained, we will represent them together with their logarithms, in order to be able to see more clearly the different behaviours obtained in the simulations.

In this way, our simulations.R script with the simulate function looks like:


Now we create a second script, SP500_study.R, which will call the simulations function created earlier and store the different returns obtained in the simulations. To do this, we first load the script function simulations.R. Next, we load the historical dataframe that we store in SP500.csv. We declare the number of years we want to simulate, in this case 20 years, and the number of simulations we want to perform, in this case 1000. Finally, we call the simulate function. This time we will represent the simulations to get an idea of the general behaviour of the index in the simulations.

A final detail to simulate is the fees of the index fund through which we invest. These fees are extremely low compared to other types of financial instruments (such as, in the Vanguard U.S. 500 Stock Index Fund, they are 0.10% annually, or, in the Fidelity SP 500 Index Fund, they are 0.06% annually). For this reason they have little impact on performance, but they should be taken into account. We do this through the following line of code, in which we assume that we have annual fees of 0.1%:

If we represent the average annual returns vector, once we have discounted the commissions, in a histogram, we can see how these are accumulated around 1.10, i.e. 10% per year, and how there are very few occasions in which we obtain an annual return under 0%.

Here concludes the first part of this article. In the second part, we will increase the number of simulations performed, from thousand to one million, and we will estimate the distribution of the returns obtained in order to answer the questions we posed at the beginning: What is the probability of ending up losing money in a SP500 index fund in a period of, for example, 20 years? What is the probability of obtaining an average annual return of more than 5% per year? How many years should I keep my capital in the fund to be 99% safe of increasing my capital?…

Here is the link to the second part and the link to the GitHub repository where the code used in this article is located.

Student of the Master’s Degree in Data Science at the University of Cantabria | Mathematician. www.linkedin.com/in/marcoscobocarrillo

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium


Top 5 CV Tips for Towards Your First Data Scientist Position

A road accident just happened! What should you do?

Imputing Missing Well Log Data Values with Simple Statistics and KNN Imputer

The Starting Point of Being a Data Analyst — Practicum Project of Engage3

The hardest question you’ve been asked in a data science interview

Data Analytics Capstone Case Study

Characteristics and Architecture of a Self Service Data Lake

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Marcos Cobo Carrillo

Marcos Cobo Carrillo

Student of the Master’s Degree in Data Science at the University of Cantabria | Mathematician. www.linkedin.com/in/marcoscobocarrillo

More from Medium

Four interpretable algorithms that you should use in 2022

The “One Minute Map” — static and interactive maps creation with the R tmap package

Fitting Mixed Effects Models — Python, Julia or R?

D4S Sunday Briefing #139