✏️ Esercizi#

In questo esercizio esamineremo i dati contenuti nel file MehrSongSpelke_exp_1.csv che sono stati messi a disposizione da Mehr, S. A., Song, L. A., & Spelke, E. S. nel loro articolo For 5-month-old infants. Questi dati sono descritti nell’esempio del laboratorio didattico presentato in queste pagine web. Nel presente esercizio ci concentreremo sui problemi della manipolazione e visualizzazione dei dati.

Iniziamo ad importare le librerie necessarie.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats as stats

%config InlineBackend.figure_format = 'retina'
RANDOM_SEED = 42
rng = np.random.default_rng(RANDOM_SEED)

plt.style.use("https://raw.githubusercontent.com/NeuromatchAcademy/course-content/main/nma.mplstyle")

Importiamo i dati da un file csv.

tot_df = pd.read_csv("../chapter_0/laboratorio_didattico/MehrSongSpelke_exp_1.csv")

Esaminiamo a che classe (type) appartiene l’oggetto tot_df.

type(tot_df)
pandas.core.frame.DataFrame

L’output della funzione type ci dice che tot_df è un oggetto Pandas DataFrame.

Vogliamo poi sapere quante colonne e righe ci sono in tot_df. I nomi delle colonne sono restituiti dall’attributo .columns del DataFrame. Dato che il numero di colonne è molto grande, possiamo stamparle sullo schermo usando la sintassi seguente.

print(*tot_df.columns)
id study_code exp1 exp2 exp3 exp4 exp5 dob dot1 dot2 dot3 female dad train Baseline_Proportion_Gaze_to_Singer Familiarization_Gaze_to_Familiar Familiarization_Gaze_to_Unfamiliar Test_Proportion_Gaze_to_Singer Difference_in_Proportion_Looking Estimated_Total_Number_of_Song totskypesing stim othersing comply_no module skype_before ammat ammar ammatot ammapr ipad_num famtot_6 unfamtot_6 totprac totw totnw age length delay mtotsing mbabylike msingcomf mtotrecord m_othersong pright diarymissing comply_fup survey_completion smsingrate smtalkrate gzsingrate gztalkrate famtot unfamtot totsing1 babylike1 singcomf1 totrecord1 othersong1 dtword1 dtnoword1 totsing2 babylike2 singcomf2 totrecord2 othersong2 dtword2 dtnoword2 totsing3 babylike3 singcomf3 totrecord3 othersong3 dtword3 dtnoword3 totsing4 babylike4 singcomf4 totrecord4 othersong4 dtword4 dtnoword4 totsing5 babylike5 singcomf5 totrecord5 othersong5 dtword5 dtnoword5 totsing6 babylike6 singcomf6 totrecord6 othersong6 dtword6 dtnoword6 totsing7 babylike7 singcomf7 totrecord7 othersong7 dtword7 dtnoword7 totsing8 babylike8 singcomf8 totrecord8 othersong8 dtword8 dtnoword8 totsing9 babylike9 singcomf9 totrecord9 othersong9 dtword9 dtnoword9 totsing10 babylike10 singcomf10 totrecord10 othersong10 dtword10 dtnoword10 totsing11 babylike11 singcomf11 totrecord11 othersong11 dtword11 dtnoword11 totsing12 babylike12 singcomf12 totrecord12 othersong12 dtword12 dtnoword12 totsing13 babylike13 singcomf13 totrecord13 othersong13 dtword13 dtnoword13 totsing14 babylike14 singcomf14 totrecord14 othersong14 dtword14 dtnoword14 filter_$

L’attributo .shape ci restituisce l’informazione per cui l’oggetto tot_df è costituito da 96 righe e 153 colonne.

tot_df.shape
(96, 153)

È difficile lavorare con tutte queste colonne, per cui creeremo un DataFrame che contiene solo un sottoinsieme di colonne del DataFrame originale.

Per estrarre una singola colonna dal DataFrame usiamo la seguente sintassi. Ad esempio, estraiamo la colonna exp1.

tot_df["exp1"]
0     1
1     1
2     1
3     1
4     1
     ..
91    0
92    0
93    0
94    0
95    0
Name: exp1, Length: 96, dtype: int64

Si noti che una sintassi equivalente è la seguente.

tot_df.exp1
0     1
1     1
2     1
3     1
4     1
     ..
91    0
92    0
93    0
94    0
95    0
Name: exp1, Length: 96, dtype: int64

Con la funzione unique ottenamo la lista delle modalità di exp1.

tot_df["exp1"].unique()
array([1, 0])

Quindi, exp1 codifica l’appartenenza di ciascuna osservazione (valore 1) al primo esperimento; il valore 0 indica che la riga per la quale exp1 vale 0 non appartiene al primo esperimento.

Selezioniamo solo le colonne indicate. Si noti la sintassi con la doppia parentesi quadra. Le parentesi quadre interne specificano una lista.

temp = tot_df[["exp1", "female", "stim", "age", "Test_Proportion_Gaze_to_Singer"]]
temp.head()
exp1 female stim age Test_Proportion_Gaze_to_Singer
0 1 0 "C1" 5.848049 0.602740
1 1 0 "C1" 5.979466 0.683027
2 1 0 "C2" 5.749486 0.724138
3 1 1 "C3" 5.913758 0.281654
4 1 1 "C4" 5.946612 0.498542

Selezioniamo solo le osservazioni dell’esperimento 1.

df = temp[temp["exp1"] == 1]
temp.head()
exp1 female stim age Test_Proportion_Gaze_to_Singer
0 1 0 "C1" 5.848049 0.602740
1 1 0 "C1" 5.979466 0.683027
2 1 0 "C2" 5.749486 0.724138
3 1 1 "C3" 5.913758 0.281654
4 1 1 "C4" 5.946612 0.498542

Possiamo eliminare la colonna exp1 che ora è inutile.

df = df.drop(columns=["exp1"])
df.head()
female stim age Test_Proportion_Gaze_to_Singer
0 0 "C1" 5.848049 0.602740
1 0 "C1" 5.979466 0.683027
2 0 "C2" 5.749486 0.724138
3 1 "C3" 5.913758 0.281654
4 1 "C4" 5.946612 0.498542
df.tail()
female stim age Test_Proportion_Gaze_to_Singer
27 1 "C4" 5.355236 0.531100
28 0 "C8" 5.223819 0.541899
29 1 "C5" 6.045175 0.700389
30 1 "C4" 5.848049 0.762963
31 1 "C6" 5.420945 0.460274

Il DataFrame df è costituito da 32 righe e 4 colonne.

df.shape
(32, 4)

L’attributo dtypes ci restituisce il tipo di ciascuna colonna.

df.dtypes
female                              int64
stim                               object
age                               float64
Test_Proportion_Gaze_to_Singer    float64
dtype: object
  • La variabile female è di tipo numerico e assume valori interi.

  • La variabile stim è una variabile categorica e le sue modalità sono rappresentate da stringhe.

  • Le variabili age e Test_Proportion_Gaze_to_Singer sono di tipo numerico e possono assumere valori decimali.

Selezionare le righe#

Le righe di un DataFrame possono essere selezionate usando il nome della riga o l’indice di riga.

È possibile utilizzare l’attributo .loc[] sul DataFrame per selezionare le righe in base all’etichetta dell’indice.

Selezioniamo la prima riga.

df.loc[0]
female                                   0
stim                                  "C1"
age                               5.848049
Test_Proportion_Gaze_to_Singer     0.60274
Name: 0, dtype: object

L’ultima riga.

df.loc[31]
female                                   1
stim                                  "C6"
age                               5.420945
Test_Proportion_Gaze_to_Singer    0.460274
Name: 31, dtype: object

Le prime 5 righe.

df.loc[0:4]
female stim age Test_Proportion_Gaze_to_Singer
0 0 "C1" 5.848049 0.602740
1 0 "C1" 5.979466 0.683027
2 0 "C2" 5.749486 0.724138
3 1 "C3" 5.913758 0.281654
4 1 "C4" 5.946612 0.498542

Le ultime 5 righe.

df.tail()
female stim age Test_Proportion_Gaze_to_Singer
27 1 "C4" 5.355236 0.531100
28 0 "C8" 5.223819 0.541899
29 1 "C5" 6.045175 0.700389
30 1 "C4" 5.848049 0.762963
31 1 "C6" 5.420945 0.460274

Le righe aventi l’indice 0, 3, 6, 9.

df.loc[[0, 3, 6, 9]]
female stim age Test_Proportion_Gaze_to_Singer
0 0 "C1" 5.848049 0.602740
3 1 "C3" 5.913758 0.281654
6 1 "C6" 5.486653 0.417755
9 1 "C1" 5.420945 0.586294

.iloc[] fa la stessa cosa di .loc[], ma viene utilizzato per selezionare le righe in base al numero dell’indice della riga. Nel nostro esempio attuale, .iloc[] e .loc[] si comporteranno esattamente allo stesso modo poiché le etichette degli indici sono i numeri delle righe. Tuttavia, bisogna tenere presente che le etichette degli indici non devono necessariamente essere i numeri delle righe.

Per esempio, ripetiamo l’ultima operazione.

df.iloc[[0, 3, 6, 9]]
female stim age Test_Proportion_Gaze_to_Singer
0 0 "C1" 5.848049 0.602740
3 1 "C3" 5.913758 0.281654
6 1 "C6" 5.486653 0.417755
9 1 "C1" 5.420945 0.586294

.iloc[] consente di selezionare simultaneamente righe e colonne. Per esempio, seleziamo tutte le righe delle colonne female e age. Poi stampiamo solo le prime 5 righe.

df.iloc[:, [0, 2]].head()
female age
0 0 5.848049
1 0 5.979466
2 0 5.749486
3 1 5.913758
4 1 5.946612

Statistiche descrittive#

Caloliamo la media della variabile Test_Proportion_Gaze_to_Singer.

df["Test_Proportion_Gaze_to_Singer"].mean()
0.59349125

Calcoliamo la deviazione standard quale statistica descrittiva.

df["Test_Proportion_Gaze_to_Singer"].std(ddof=0)
0.17587420037119572

Calcoliamo le statistiche elencate per le variabili age`` e Test_Proportion_Gaze_to_Singer`.

summary_stats = [np.min, np.median, np.mean, np.std, np.max]
result = df[["age", "Test_Proportion_Gaze_to_Singer"]].aggregate(summary_stats)
print(result)
             age  Test_Proportion_Gaze_to_Singer
min     5.059548                        0.262846
median  5.700205                        0.556953
mean    5.612936                        0.593491
std     0.312476                        0.178688
max     6.110883                        0.950920
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/3833652362.py:2: FutureWarning: The provided callable <function min at 0x1101f72e0> is currently using Series.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
  result = df[["age", "Test_Proportion_Gaze_to_Singer"]].aggregate(summary_stats)
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/3833652362.py:2: FutureWarning: The provided callable <function median at 0x110324d60> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.
  result = df[["age", "Test_Proportion_Gaze_to_Singer"]].aggregate(summary_stats)
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/3833652362.py:2: FutureWarning: The provided callable <function mean at 0x1101f7ba0> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  result = df[["age", "Test_Proportion_Gaze_to_Singer"]].aggregate(summary_stats)
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/3833652362.py:2: FutureWarning: The provided callable <function std at 0x1101f7ce0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  result = df[["age", "Test_Proportion_Gaze_to_Singer"]].aggregate(summary_stats)
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/3833652362.py:2: FutureWarning: The provided callable <function max at 0x1101f71a0> is currently using Series.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  result = df[["age", "Test_Proportion_Gaze_to_Singer"]].aggregate(summary_stats)
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/3833652362.py:2: FutureWarning: The provided callable <function min at 0x1101f72e0> is currently using Series.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
  result = df[["age", "Test_Proportion_Gaze_to_Singer"]].aggregate(summary_stats)
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/3833652362.py:2: FutureWarning: The provided callable <function median at 0x110324d60> is currently using Series.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.
  result = df[["age", "Test_Proportion_Gaze_to_Singer"]].aggregate(summary_stats)
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/3833652362.py:2: FutureWarning: The provided callable <function mean at 0x1101f7ba0> is currently using Series.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  result = df[["age", "Test_Proportion_Gaze_to_Singer"]].aggregate(summary_stats)
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/3833652362.py:2: FutureWarning: The provided callable <function std at 0x1101f7ce0> is currently using Series.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  result = df[["age", "Test_Proportion_Gaze_to_Singer"]].aggregate(summary_stats)
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/3833652362.py:2: FutureWarning: The provided callable <function max at 0x1101f71a0> is currently using Series.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  result = df[["age", "Test_Proportion_Gaze_to_Singer"]].aggregate(summary_stats)

Ripetiamo i calcoli precedenti separatamente per maschi e femmine.

summary_stats = [np.min, np.median, np.mean, np.std, np.max]
result = df.groupby("female")[["age", "Test_Proportion_Gaze_to_Singer"]].aggregate(summary_stats)
print(result)
             age                                         \
             min    median      mean      std       max   
female                                                    
0       5.059548  5.749486  5.563313  0.35660  6.110883   
1       5.256673  5.552361  5.656722  0.27123  6.110883   

       Test_Proportion_Gaze_to_Singer                                          
                                  min    median      mean       std       max  
female                                                                         
0                            0.262846  0.542105  0.607096  0.214455  0.950920  
1                            0.281654  0.571801  0.581487  0.145927  0.811189  
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/4280065469.py:2: FutureWarning: The provided callable <function min at 0x1101f72e0> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
  result = df.groupby("female")[["age", "Test_Proportion_Gaze_to_Singer"]].aggregate(summary_stats)
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/4280065469.py:2: FutureWarning: The provided callable <function median at 0x110324d60> is currently using SeriesGroupBy.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.
  result = df.groupby("female")[["age", "Test_Proportion_Gaze_to_Singer"]].aggregate(summary_stats)
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/4280065469.py:2: FutureWarning: The provided callable <function mean at 0x1101f7ba0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  result = df.groupby("female")[["age", "Test_Proportion_Gaze_to_Singer"]].aggregate(summary_stats)
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/4280065469.py:2: FutureWarning: The provided callable <function std at 0x1101f7ce0> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  result = df.groupby("female")[["age", "Test_Proportion_Gaze_to_Singer"]].aggregate(summary_stats)
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/4280065469.py:2: FutureWarning: The provided callable <function max at 0x1101f71a0> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  result = df.groupby("female")[["age", "Test_Proportion_Gaze_to_Singer"]].aggregate(summary_stats)

In maniera equivalente, possiamo usare la sintassi seguente. Si noti l’arrotondamento.

summary_stats = (
    df.loc[:, ["female", "age", "Test_Proportion_Gaze_to_Singer"]]
    .groupby(["female"])
    .aggregate([np.min, np.median, np.mean, np.std, np.max])
)
summary_stats.round(2)
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/3558284971.py:4: FutureWarning: The provided callable <function min at 0x1101f72e0> is currently using SeriesGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
  .aggregate([np.min, np.median, np.mean, np.std, np.max])
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/3558284971.py:4: FutureWarning: The provided callable <function median at 0x110324d60> is currently using SeriesGroupBy.median. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "median" instead.
  .aggregate([np.min, np.median, np.mean, np.std, np.max])
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/3558284971.py:4: FutureWarning: The provided callable <function mean at 0x1101f7ba0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  .aggregate([np.min, np.median, np.mean, np.std, np.max])
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/3558284971.py:4: FutureWarning: The provided callable <function std at 0x1101f7ce0> is currently using SeriesGroupBy.std. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "std" instead.
  .aggregate([np.min, np.median, np.mean, np.std, np.max])
/var/folders/cl/wwjrsxdd5tz7y9jr82nd5hrw0000gn/T/ipykernel_4353/3558284971.py:4: FutureWarning: The provided callable <function max at 0x1101f71a0> is currently using SeriesGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
  .aggregate([np.min, np.median, np.mean, np.std, np.max])
age Test_Proportion_Gaze_to_Singer
min median mean std max min median mean std max
female
0 5.06 5.75 5.56 0.36 6.11 0.26 0.54 0.61 0.21 0.95
1 5.26 5.55 5.66 0.27 6.11 0.28 0.57 0.58 0.15 0.81

Esaminimo il numero di maschi e femmine.

df.groupby(["female"]).size()
female
0    15
1    17
dtype: int64

Considerimo ora la variabile age.

df["age"] 
0     5.848049
1     5.979466
2     5.749486
3     5.913758
4     5.946612
5     5.749486
6     5.486653
7     5.749486
8     6.110883
9     5.420945
10    6.110883
11    5.552361
12    5.749486
13    5.782341
14    5.815195
15    5.158111
16    5.256673
17    5.059548
18    5.190965
19    5.880904
20    5.256673
21    5.289528
22    5.552361
23    5.552361
24    5.650924
25    5.092402
26    5.815195
27    5.355236
28    5.223819
29    6.045175
30    5.848049
31    5.420945
Name: age, dtype: float64

Creiamo 5 classi che suddividendo i valori di questa variabile in 5 decili (il primo conterrà il 20% dei valori più bassi, ecc.).

df["age_deciles"] = pd.qcut(df["age"], 5, labels=False)
df["age_deciles"]
0     3
1     4
2     2
3     4
4     4
5     2
6     1
7     2
8     4
9     1
10    4
11    1
12    2
13    3
14    3
15    0
16    0
17    0
18    0
19    4
20    0
21    1
22    1
23    1
24    2
25    0
26    3
27    1
28    0
29    4
30    3
31    1
Name: age_deciles, dtype: int64

Calcoliamo la media dei valori Test_Proportion_Gaze_to_Singer in ciascuna delle 5 classi d’età.

prop_by_age = df.groupby('age_deciles')['Test_Proportion_Gaze_to_Singer'].mean()
prop_by_age
age_deciles
0    0.453139
1    0.584660
2    0.758172
3    0.722895
4    0.533876
Name: Test_Proportion_Gaze_to_Singer, dtype: float64

Rappresentiamo le medie di Test_Proportion_Gaze_to_Singer in funzoine delle 5 classi d’età.

prop_by_age.plot();
../_images/7a88ba1cc74cac479ff0cde6989afbe47a78d49c93fa72a0b14b43721e7ab2b4.png

Non emerge alcuna tendenza degna di rilievo.

Creaimo un istogramma per la variabile Test_Proportion_Gaze_to_Singer.

plt.hist(df["Test_Proportion_Gaze_to_Singer"], density=True);
../_images/3eaf7a0c7322b1f441af6322f597089084eb5002b8fb312239ec0c2bbc9e49b2.png

Rappresentiamo gli stessi dati con un KDE plot.

density = stats.gaussian_kde(df["Test_Proportion_Gaze_to_Singer"])
xs = np.linspace(min(df["Test_Proportion_Gaze_to_Singer"]), max(df["Test_Proportion_Gaze_to_Singer"]), 1000)
plt.plot(xs, density(xs));
../_images/10793b686e4aac1263aeee643d966e424457f868eab9f34c54b2574c6f9af494.png

Usando seaborn è facile creare due KDE plot sovrapposti, uno per i maschi e uno per le femmine.

# KDE plot for female = 0 (or the first level)
sns.kdeplot(df[df["female"] == 0]["Test_Proportion_Gaze_to_Singer"], label='Female = 0')
# KDE plot for female = 1 (or the second level)
sns.kdeplot(df[df["female"] == 1]["Test_Proportion_Gaze_to_Singer"], label='Female = 1')
plt.legend();
../_images/ed2b48ff8754888421940160276600f87fe6b1f4a196ccd080e2ade1850b69b6.png

Creiamo due boxplot, uno per i maschi e uno per le femmine.

# Filter the data by the levels of the "female" variable
female_0 = df[df["female"] == 0]["Test_Proportion_Gaze_to_Singer"]
female_1 = df[df["female"] == 1]["Test_Proportion_Gaze_to_Singer"]

# Combine the filtered data
data = [female_0, female_1]

# Plot the boxplots
plt.boxplot(data, labels=['Males', 'Females'])

plt.title('Boxplot of Test Proportion Gaze to Singer by Gender')
plt.ylabel('Test Proportion Gaze to Singer');
../_images/9a4c6f39cdfba951a4192b37a27dce2c6937a2ece0baa8dd96090e453e37d280.png