You are a data analyst for a basketball team and have access to a large set of historical data that you can use to analyze performance patterns. The coach of the team and your management have requested that you come up with regression models that predict the number of wins in a regular game based on the performance metrics that are included in the data set. These regression models will help make key decisions to improve the performance of the team. You will use the Python programming language to perform the statistical analyses and then prepare a report of your findings to present for the team’s management. Since the managers are not data analysts, you will need to interpret your findings and describe their practical implications.
Note: This data set has been “cleaned” for the purposes of this assignment.
Reference
FiveThirtyEight. (April 26, 2019). FiveThirtyEight NBA Elo dataset. Kaggle. Retrieved from https://www.kaggle.com/fivethirtyeight/fivethirtyeight-nba-elo-dataset/Directions
For this project, you will submit the Python script you used to make your calculations and a summary report explaining your findings.
- Python Script: To complete the tasks listed below, open the Project Three Jupyter Notebook link in the Assignment Information module. This notebook contains your data set and the Python scripts for your project. In the notebook, you will find step-by-step instructions and code blocks that will help you complete the following tasks:
- Simple Linear Regression
- Create scatterplots
- Compute the correlation coefficient
- Conduct a linear regression
- Multiple Regression
- Create scatterplots
- Compute the correlation matrix
- Conduct a multiple regression analysis
- Simple Linear Regression
- Summary Report: Once you have completed all the steps in your Python script, you will create a summary report to present your findings. Use the provided template to create your report. You must complete each of the following sections:
- Introduction: Set the context for your scenario and the analyses you will be performing.
- Scatterplots and Correlation: Discuss relationships between variables using scatterplots and correlation coefficients.
- Simple Linear Regression: Create a simple linear regression model to predict the response variable.
- Multiple Regression: Create a multiple regression model to predict the response variable.
- Conclusion: Summarize your findings and explain their practical implications.
What to Submit
To complete this project, you must submit the following:
Python ScriptYour Jupyter Notebook Python script contains all the statistical analyses you completed for this project. You downloaded your work as an HTML file. Review the file to make sure that every step and all your outputs are included. Submit the HTML file as part of your submission. Review the Jupyter Notebook in Codio Tutorial in the Supporting Materials section if you need help.
Summary ReportUse the provided template to create your summary report. The template contains guiding questions to help you complete each section. Be sure to remove these questions before submitting your report. Your summary report should be submitted as a 3- to 5-page Microsoft Word document. It should include an APA-style cover page and APA citations for any sources used. Use double spacing, 12-point Times New Roman font, and one-inch margins.
Python Codes:
Step 1: Generating sample data
This block of Python code will generate a unique sample of size 50 that you will use in this discussion. Note that your sample will be unique and therefore your answers will be unique as well. The numpy module in Python allows you to create a data set using a Normal distribution. Note that the mean and standard deviation were chosen for you. The data set will be saved in a Python dataframe that will be used in later calculations.
Click the block of code below and hit the Run button above.
In [1]:import pandas as pdimport numpy as npimport mathimport scipy.stats as st# create 50 randomly chosen values from a Normal distribution. (arbitrarily using mean=2.48 and standard deviation=0.50). diameters = np.random.normal(2.4800,0.500,50)# convert the array into a dataframe with the column name “diameters” using pandas library.diameters_df = pd.DataFrame(diameters, columns=[‘diameters’])diameters_df = diameters_df.round(2)# print the dataframe (note that the index of dataframe starts at 0).print(“Diameters data frame\n”)print(diameters_df)Diameters data frame diameters 0 2.10 1 2.77 2 2.74 3 2.55 4 2.58 5 2.50 6 2.96 7 2.74 8 1.77 9 2.97 10 2.62 11 2.46 12 2.40 13 0.90 14 2.07 15 2.08 16 1.73 17 2.11 18 1.93 19 2.06 20 2.10 21 2.57 22 1.51 23 2.76 24 2.41 25 3.17 26 2.41 27 2.96 28 2.26 29 2.43 30 2.19 31 2.14 32 2.48 33 0.96 34 2.05 35 2.29 36 2.96 37 2.37 38 2.06 39 2.29 40 2.66 41 2.54 42 2.80 43 1.99 44 2.07 45 1.78 46 3.84 47 2.39 48 3.20 49 2.79 Step 2: Constructing confidence intervals
You will assume that the population standard deviation is known and that the sample size is sufficiently large. Then you will use the Normal distribution to construct these confidence intervals. You will use the submodule scipy.stats to construct confidence intervals using your sample data.
Click the block of code below and hit the Run button above.
In [3]:# Python methods that calculate confidence intervals require the sample mean and the standard error as inputs.# calculate the sample meanmean = diameters_df[‘diameters’].mean()# input the population standard deviation, which was given in Step 1.std_deviation = 0.5000# calculate standard error = standard deviation / sqrt(n) where n is the sample size.stderr = std_deviation/math.sqrt(len(diameters_df[‘diameters’]))# construct a 90% confidence interval.conf_int_90 = st.norm.interval(0.90, mean, stderr)print(“90% confidence interval (unrounded) =”, conf_int_90)print(“90% confidence interval (rounded) = (“, round(conf_int_90[0], 2), “,”, round(conf_int_90[1], 2), “)”)print(“”)# construct a 99% confidence interval.conf_int_99 = st.norm.interval(0.99, mean, stderr)print(“99% confidence interval (unrounded) =”, conf_int_99)print(“99% confidence interval (rounded) = (“, round(conf_int_99[0], 2), “,”, round(conf_int_99[1], 2), “)”)90% confidence interval (unrounded) = (2.2530912846323328, 2.4857087153676676) 90% confidence interval (rounded) = ( 2.25 , 2.49 ) 99% confidence interval (unrounded) = (2.1872613632281555, 2.551538636771845) 99% confidence interval (rounded) = ( 2.19 , 2.55 ) Step 3: Performing hypothesis testing for the population mean
Since you were given the population standard deviation in Step 1 and the sample size is sufficiently large, you can use the z-test for population means. The z-test method in statsmodels.stats.weightstats submodule runs the z-test. The input to this method is the sample dataframe and the value under the null hypothesis. The output is the test-statistic and the two-tailed P-value.
Click the block of code below and hit the Run button above.
In [4]:from statsmodels.stats.weightstats import ztest# run z-test hypothesis test for population mean. The value under the null hypothesis is 2.30.test_statistic, p_value = ztest(x1 = diameters_df[‘diameters’], value = 2.30)print(“z-test hypothesis test for population mean”)print(“test-statistic =”, round(test_statistic,2))print(“two tailed p-value =”,round(p_value,4))z-test hypothesis test for population mean test-statistic = 0.94 two tailed p-value = 0.3481