# simple mathematical models and formula wizardry

The assignment and data files are attached to this assignment link.

## Analysis 1 â€“ Predicting sales (revenue) at Acme Software

This uses concepts and techniques we learned in the Golf Clubs Pricing Example we went through in class. In fact, feel free to use the file from that example as a starting template. **However, youâ€™ll be making significant changes to it as this problem differs from that one in several ways (such as using RMSE as the error metric instead of MAPE â€“ details below).**

**1 â€“ Time Series Plot**) Use the **SalesData.xlsx** data file provided with this assignment. On the first sheet, named Period1, it contains Acme total sales data from 1984-2012. Create a scatter plot which shows sales over time for the 1984-2012 period. **Suggestion (strong)**: Do NOT use the Year column directly as your X-axis for the scatter. Instead, create a YearIndex column which starts at 1 (for 1984) and use YearIndex in your scatter plot. Move your scatter plot to a chart sheet and make sure it is properly formatted and includes a title, axis labels, and whatever else you think it needs. Use principles of graphing excellence that we covered in the Data Visualization module. You can use this chart sheet as the basis for the trend fit charts in the next part (i.e. you can copy the sheet as needed).

**1a â€“ Trend Line Fitting**) Using Excelâ€™s Fit Trend Line feature of graphs, fit six different trend lines: linear, exponential, power, logarithmic, quadratic, and cubic. Which fits best? To determine which fits best, use the same approach as used in the Golf Clubs Pricing example we covered in class. However, do **NOT** use MAPE as your error metric. The error metric you are to use for choosing between the six candidate models is *root mean square error* (RMSE). RMSE is commonly used in data mining to evaluate the relative predictive accuracy of competing models. For a dataset of size n, the RMSE is defined as:

where and . So, basically, we are just squaring the forecast errors, adding them all up, dividing by the sample size to get an average or *mean squared error (MSE)*, and then taking the square root.

Describe how each of the models does in terms of fitting the data. Which models seem to fit the best? Which fit the worst?

**1b â€“ Forecasting**) Use all six models that you fit Part 1a to PREDICT sales values for 2013-2018. Youâ€™ll end up with six sets (of 6 years each ) of predictions (one set for each model). You are using the models you fit with the 1984-2012 data to make predictions (forecasts) for 2013-2018.

**1c â€“ Evaluating the Forecasts**) Now, letâ€™s add some more recent data to see how our models did. On the second sheet (Period2) of **SalesData.xlsx** you will find actual sales data from 2013-2018. Compute the RMSE for your predictions.

Which of the models performed best in terms of RMSE for forecasting 2013-2018 sales? Was it the same model that had the lowest RMSE in the model fitting portion (1a)?

Do the models perform better or worse during the prediction phase than they did during the model fitting phase? Why do you think this is the case?

Hereâ€™s what my main analysis sheet looks like. **IMPORTANT**: **My screenshot below is from a similar problem but with different data and a different number of years in Period1 and Period2**. Iâ€™m just showing you the overall layout of what my solution would look like. Obviously the graphs and fitted parameter values are on other sheets.

**1e â€“ Making a prediction for 2019**

Your job is to make a prediction for Acme for 2019 using ** one** of the six types of trend lines you used in this problem (linear, exponential, power, log, quadratic, and cubic). However, you do NOT need to use all of the data. In other words, you do not need to fit a trend line for the entire period 1984-2018. Use your analytical judgment to select the period of data you think is most relevant and use just that data to fit your trend line. Using your fitted trendline, compute your forecast for 2019. Also, include a discussion of why you chose the specific trend line function you did. Later, Iâ€™ll be posting a file containing everyoneâ€™s predictions and their model which provided the prediction.

**EXTRA CREDIT: Trend Fitting using Functions**

Redo the model fitting you did in part 1b for the exponential and cubic models, but do NOT use the Fit Trend Line feature of Excel graphs. Instead, use the approach shown in the **TrendFittingFunctions.xls** file found on our Moodle site. There is NO need to redo the RMSE analysis as the results of model fitting will be identical to what you already found. If you are really clever you can combine this with the use of the INDEX() function to pluck the parameters out of the results of the various array formulas. See the Excel Help for LINEST for a hint.

## Analysis 2 â€“ Excel formula work

**2a)** **Timestamp conversion**

Iâ€™ve given you a file named **timestamp_conversion_halfhour.xlsx**. In it you will find some data and a description of the formula that you need to create to answer the question. Obviously, youâ€™ll just be building your formulas directly in that spreadsheet and will be submitting the file as part of your homework submission.

**2b) Conditional summing and counting using bike share data**

Many cities have started bike sharing programs. These programs often capture very detailed data related to bike usage and such data can be quite helpful in managing these systems. In this problem, weâ€™ll be using data from San Franciscoâ€™s Cycle Share program. Iâ€™ve given you a file named **trip.xlsx**. Each row is a bike rental and includes information on when the bike was rented and returned, the station from which it was rented and to which it was returned and some other info. The fields are pretty self-explanatory. You can read more about about bike share data at https://www.kaggle.com/benhamner/sf-bay-area-bike-share. I modified the data somewhat from this source, so my data file doesnâ€™t match the one at Kaggle exactly but is similar. I often use this type of data in the exploratory data analysis assignment (HW3). Weâ€™ll use it in HW2 to get some practice with conditional counting and summing.

**Step 1** â€“ Explore the **trip.xlsx** file. The sheet named **trip** has the data and youâ€™ll see that it has just over 200,000 rows. Iâ€™ve already created range names using the column names. Go into the Name Manager or use the Name Box to see this. You should use these range names in your formulas.

**Step 2** â€“ Go to the sheet named **analysis**. On it youâ€™ll find a bunch of questions and guidance for building a series of formulas to do various summary computations such as counts and summations. Just build the formulas that are asked of you. Youâ€™ll see that in some I specify that you must use a specific Excel function (e.g. COUNTIF) while in others I specify you must use an array formula and CANNOT use things like COUNTIF or SUMIF. Obviously, you can always check your answers by using things like Pivot Tables.

Youâ€™ll be submitting the **trip.xlsx** file as part of your HW2 submission.