# Here, you are provided with code to "recreate" the dataset summarized in the two-way table above: import pandas as pd n_exposed = 6+16 # Number of those "exposed" n_unexposed = 399+2064 # Number of those "unexposed" df = pd.DataFrame({'Exposure' : ['Exposed']*n_exposed + ['Unexposed']*n_unexposed, # replicate "Exposed" and "Unexposed" the appropriate number of times and store in a column 'ALL' : ['Yes']*6 + ['No']*16 + ['Yes']*399 + ['No']*2064}) # replicate "Yes" and "No" the appropriate number of time (and in the appropriate rows) and store in a column # Here, you are provided with code to "recreate" the dataset summarized in the two-way table above: import pandas as pd n_exposed = 6+16 # Number of those "exposed" n_unexposed = 399+2064 # Number of those "unexposed" df = pd.DataFrame({'Exposure' : ['Exposed']*n_exposed + ['Unexposed']*n_unexposed, # replicate "Exposed" and "Unexposed" the appropriate number of times and store in a column 'ALL' : ['Yes']*6 + ['No']*16 + ['Yes']*399 + ['No']*2064}) # replicate "Yes" and "No" the appropriate number of time (and in the appropriate rows) and store in a column # 1. Write your null and alternative hypotheses. # 2. Perform the hypothesis test using a randomization distribution. # 3. Report the p-value. # 4. What is the conclusion of your test? State this in terms of the hypotheses, and also in terms of the original question above. Make sure to indicate the significance level you are using. # Here, you are provided with code to "recreate" the dataset summarized in the two-way table above: import pandas as pd n_exposed = 6+16 # Number of those "exposed" n_unexposed = 399+2064 # Number of those "unexposed" df = pd.DataFrame({'Exposure' : ['Exposed']*n_exposed + ['Unexposed']*n_unexposed, # replicate "Exposed" and "Unexposed" the appropriate number of times and store in a column 'ALL' : ['Yes']*6 + ['No']*16 + ['Yes']*399 + ['No']*2064}) # replicate "Yes" and "No" the appropriate number of time (and in the appropriate rows) and store in a column # 1. Write your null and alternative hypotheses. # H_o: p_{exposed} - p_{unexposed} = 0 or H_o: p_{exposed} = p_{unexposed} # null has to have = # H_a: p_{exposed} - p_{unexposed} > 0$ or H_a: p_{exposed} > p_{unexposed} # use > because sample difference (below) is larger than 0 # 2. Perform the hypothesis test using a randomization distribution. p_exp_yes = len(df[(df['Exposure'] == 'Exposed') & (df['ALL'] == 'Yes')])/len(df[df['Exposure'] == 'Exposed']) # proportion of ALL for exposed p_unexp_yes = len(df[(df['Exposure'] == 'Unexposed') & (df['ALL'] == 'Yes')])/len(df[df['Exposure'] == 'Unexposed']) # proportion of ALL for unexposed pdiff = p_exp_yes - p_unexp_yes # sample difference of proportions from study data import numpy as np samp = df.copy(deep = True) N = 1000 n = samp.shape[0] real_df = pd.DataFrame({'sample' : [np.nan] * (N), 'difference' : [np.nan] * (N)}) ### Reallocate! : for j in range(N): samp['ALL'] = np.random.choice(samp['ALL'], size = n, replace = False) p_exp_yes = len(samp[(samp['Exposure'] == 'Exposed') & (samp['ALL'] == 'Yes')])/len(samp[samp['Exposure'] == 'Exposed']) p_unexp_yes = len(samp[(samp['Exposure'] == 'Unexposed') & (samp['ALL'] == 'Yes')])/len(samp[samp['Exposure'] == 'Unexposed']) real_df.loc[j, 'difference'] = p_exp_yes - p_unexp_yes real_df.loc[j, 'sample'] = 'real' + str(j+1) # 3. Report the p-value. print("p-value :", len(real_df[real_df['difference'] >= pdiff])/N) # 4. What is the conclusion of your test? State this in terms of the hypotheses, and also in terms of the original question above. Make sure to indicate the significance level you are using. # Because the p-value is greater than the 0.05 significance level, we cannot reject the null hypothesis that the two proportions are equal. In other words, the data do NOT provide enough evidence to say that the proportion of ALL for those exposed to fracking is significantly greater than the proportion for those unexposed. Fracking, by itself, does not appear to play a significance role in increasing ALL occurrence.