Try It Difference of Proportions


                    # Here, you are provided with code to "recreate" the dataset summarized in the two-way table above:
                    import pandas as pd
                    n_exposed = 6+16              # Number of those "exposed"
                    n_unexposed = 399+2064  # Number of those "unexposed"
                    df = pd.DataFrame({'Exposure' : ['Exposed']*n_exposed + ['Unexposed']*n_unexposed,   # replicate "Exposed" and "Unexposed" the appropriate number of times and store in a column
                                       'ALL' : ['Yes']*6 + ['No']*16 + ['Yes']*399 + ['No']*2064})                                     # replicate "Yes" and "No" the appropriate number of time (and in the appropriate rows) and store in a column


                    # Here, you are provided with code to "recreate" the dataset summarized in the two-way table above:
                    import pandas as pd
                    n_exposed = 6+16              # Number of those "exposed"
                    n_unexposed = 399+2064  # Number of those "unexposed"
                    df = pd.DataFrame({'Exposure' : ['Exposed']*n_exposed + ['Unexposed']*n_unexposed,   # replicate "Exposed" and "Unexposed" the appropriate number of times and store in a column
                                       'ALL' : ['Yes']*6 + ['No']*16 + ['Yes']*399 + ['No']*2064})                                     # replicate "Yes" and "No" the appropriate number of time (and in the appropriate rows) and store in a column
                    
                    # 1. Write your null and alternative hypotheses.
                    
                    # 2. Perform the hypothesis test using a randomization distribution.
                    
                    # 3. Report the p-value.
                    
                    # 4. What is the conclusion of your test? State this in terms of the hypotheses, and also in terms of the original question above. Make sure to indicate the significance level you are using.


                    # Here, you are provided with code to "recreate" the dataset summarized in the two-way table above:
                    import pandas as pd
                    n_exposed = 6+16              # Number of those "exposed"
                    n_unexposed = 399+2064  # Number of those "unexposed"
                    df = pd.DataFrame({'Exposure' : ['Exposed']*n_exposed + ['Unexposed']*n_unexposed,   # replicate "Exposed" and "Unexposed" the appropriate number of times and store in a column
                                       'ALL' : ['Yes']*6 + ['No']*16 + ['Yes']*399 + ['No']*2064})                                     # replicate "Yes" and "No" the appropriate number of time (and in the appropriate rows) and store in a column
                    
                    # 1. Write your null and alternative hypotheses.
                    # H_o: p_{exposed} - p_{unexposed} = 0 or H_o: p_{exposed} = p_{unexposed}   # null has to have =
                    # H_a: p_{exposed} - p_{unexposed} > 0$ or H_a: p_{exposed} > p_{unexposed}  # use > because sample difference (below) is larger than 0
                    
                    # 2. Perform the hypothesis test using a randomization distribution.
                    p_exp_yes = len(df[(df['Exposure'] == 'Exposed') & (df['ALL'] == 'Yes')])/len(df[df['Exposure'] == 'Exposed'])   # proportion of ALL for exposed
                    p_unexp_yes = len(df[(df['Exposure'] == 'Unexposed') & (df['ALL'] == 'Yes')])/len(df[df['Exposure'] == 'Unexposed'])   # proportion of ALL for unexposed
                    pdiff = p_exp_yes - p_unexp_yes   # sample difference of proportions from study data
                    
                    import numpy as np
                    samp = df.copy(deep = True)
                    N = 1000
                    n = samp.shape[0]
                    real_df = pd.DataFrame({'sample' : [np.nan] * (N),
                                            'difference' : [np.nan] * (N)})
                    
                    ### Reallocate! :
                    for j in range(N):
                      samp['ALL'] = np.random.choice(samp['ALL'], size = n, replace = False)
                      p_exp_yes = len(samp[(samp['Exposure'] == 'Exposed') & (samp['ALL'] == 'Yes')])/len(samp[samp['Exposure'] == 'Exposed'])
                      p_unexp_yes = len(samp[(samp['Exposure'] == 'Unexposed') & (samp['ALL'] == 'Yes')])/len(samp[samp['Exposure'] == 'Unexposed'])
                      real_df.loc[j, 'difference'] = p_exp_yes - p_unexp_yes
                      real_df.loc[j, 'sample'] = 'real' + str(j+1)
                    
                    # 3. Report the p-value.
                    print("p-value :", len(real_df[real_df['difference'] >= pdiff])/N)
                    
                    # 4. What is the conclusion of your test? State this in terms of the hypotheses, and also in terms of the original question above. Make sure to indicate the significance level you are using.
                    # Because the p-value is greater than the 0.05 significance level, we cannot reject the null hypothesis that the two proportions are equal. In other words, the data do NOT provide enough evidence to say that the proportion of ALL for those exposed to fracking is significantly greater than the proportion for those unexposed. Fracking, by itself, does not appear to play a significance role in increasing ALL occurrence.