{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"
\n",
"\n",
"\n",
"(pandas-functions-notebook)=\n",
"# Pandas (3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" In questo tutorial, verranno presentate le funzioni di Pandas più utili per condurre le usuali operazioni di manipolazione dei dati."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from scipy import stats\n",
"import arviz as az\n",
"import seaborn as sns"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"%config InlineBackend.figure_format = 'retina'\n",
"RANDOM_SEED = 42\n",
"rng = np.random.default_rng(RANDOM_SEED)\n",
"az.style.use(\"arviz-darkgrid\")\n",
"sns.set_theme(palette=\"colorblind\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Per questo tutorial, useremo nuovamente i dati `penguins.csv`. Come in precedenza, dopo avere caricato i dati, rimuoviamo i dati mancanti.\n",
"\n",
"https://regenerativetoday.com/30-very-useful-pandas-functions-for-everyday-data-analysis-tasks/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## `pd.read_csv`, `pd.read_excel`\n",
"\n",
"La prima funzione da menzionare è `read_csv` o `read_excel`. Le funzioni vengono utilizzate per leggere un file CSV o un file Excel in formato DataFrame di Pandas. Qui stiamo utilizzando la funzione `read_csv` per leggere il dataset `penguins`. In precedenza abbiamo anche visto come la funzione `dropna` viene utilizzata per rimuovere tutte le righe del DataFrame che includono dati mancanti."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" species | \n",
" island | \n",
" bill_length_mm | \n",
" bill_depth_mm | \n",
" flipper_length_mm | \n",
" body_mass_g | \n",
" sex | \n",
" year | \n",
"
\n",
" \n",
" \n",
" \n",
" 339 | \n",
" Chinstrap | \n",
" Dream | \n",
" 55.8 | \n",
" 19.8 | \n",
" 207.0 | \n",
" 4000.0 | \n",
" male | \n",
" 2009 | \n",
"
\n",
" \n",
" 340 | \n",
" Chinstrap | \n",
" Dream | \n",
" 43.5 | \n",
" 18.1 | \n",
" 202.0 | \n",
" 3400.0 | \n",
" female | \n",
" 2009 | \n",
"
\n",
" \n",
" 341 | \n",
" Chinstrap | \n",
" Dream | \n",
" 49.6 | \n",
" 18.2 | \n",
" 193.0 | \n",
" 3775.0 | \n",
" male | \n",
" 2009 | \n",
"
\n",
" \n",
" 342 | \n",
" Chinstrap | \n",
" Dream | \n",
" 50.8 | \n",
" 19.0 | \n",
" 210.0 | \n",
" 4100.0 | \n",
" male | \n",
" 2009 | \n",
"
\n",
" \n",
" 343 | \n",
" Chinstrap | \n",
" Dream | \n",
" 50.2 | \n",
" 18.7 | \n",
" 198.0 | \n",
" 3775.0 | \n",
" female | \n",
" 2009 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" species island bill_length_mm bill_depth_mm flipper_length_mm \\\n",
"339 Chinstrap Dream 55.8 19.8 207.0 \n",
"340 Chinstrap Dream 43.5 18.1 202.0 \n",
"341 Chinstrap Dream 49.6 18.2 193.0 \n",
"342 Chinstrap Dream 50.8 19.0 210.0 \n",
"343 Chinstrap Dream 50.2 18.7 198.0 \n",
"\n",
" body_mass_g sex year \n",
"339 4000.0 male 2009 \n",
"340 3400.0 female 2009 \n",
"341 3775.0 male 2009 \n",
"342 4100.0 male 2009 \n",
"343 3775.0 female 2009 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv(\"../data/penguins.csv\")\n",
"df.dropna(inplace=True)\n",
"df.tail()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## `.columns`\n",
"\n",
"Quando si dispone di un grande dataset, può essere difficile visualizzare tutte le colonne. Utilizzando la funzione `columns`, è possibile stampare tutte le colonne del dataset."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',\n",
" 'flipper_length_mm', 'body_mass_g', 'sex', 'year'],\n",
" dtype='object')"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## `.drop()`\n",
"\n",
"È possibile eliminare alcune colonne non necessarie utilizzando `drop`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',\n",
" 'flipper_length_mm', 'body_mass_g', 'sex'],\n",
" dtype='object')"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = df.drop(columns=[\"year\"])\n",
"df.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## `len()`\n",
"\n",
"Fornisce il numero di righe di un DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"333"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## `.query()`\n",
"\n",
"È possibile filtrare un DataFrame utilizzando un'espressione booleana."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"68"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df1 = df.query(\"species == 'Chinstrap' & island == 'Dream'\")\n",
"len(df1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## `.iloc[]`\n",
"\n",
"Questa funzione accetta come parametri gli indici delle righe e delle colonne, fornendo una selezione del DataFrame in base a questi. In questo caso, stiamo selezionando le prime 3 righe di dati e le colonne con indice 2, 3 e 5."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" bill_length_mm | \n",
" bill_depth_mm | \n",
" body_mass_g | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 39.1 | \n",
" 18.7 | \n",
" 3750.0 | \n",
"
\n",
" \n",
" 1 | \n",
" 39.5 | \n",
" 17.4 | \n",
" 3800.0 | \n",
"
\n",
" \n",
" 2 | \n",
" 40.3 | \n",
" 18.0 | \n",
" 3250.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" bill_length_mm bill_depth_mm body_mass_g\n",
"0 39.1 18.7 3750.0\n",
"1 39.5 17.4 3800.0\n",
"2 40.3 18.0 3250.0"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2 = df.iloc[:3, [2, 3, 5]]\n",
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## `.loc[]`\n",
"\n",
"Questa funzione compie un'operazione molto simile a quella della funzione `.iloc`. Tuttavia, in questo caso, abbiamo la possibilità di specificare gli indici delle righe che desideriamo, insieme ai nomi delle colonne che vogliamo includere nella nostra selezione."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" island | \n",
" flipper_length_mm | \n",
" sex | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" Torgersen | \n",
" 195.0 | \n",
" female | \n",
"
\n",
" \n",
" 4 | \n",
" Torgersen | \n",
" 193.0 | \n",
" female | \n",
"
\n",
" \n",
" 6 | \n",
" Torgersen | \n",
" 181.0 | \n",
" female | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" island flipper_length_mm sex\n",
"2 Torgersen 195.0 female\n",
"4 Torgersen 193.0 female\n",
"6 Torgersen 181.0 female"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3 = df.loc[[2, 4, 6], [\"island\", \"flipper_length_mm\", \"sex\"]]\n",
"df3"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Last updated: Mon Jan 29 2024\n",
"\n",
"Python implementation: CPython\n",
"Python version : 3.11.7\n",
"IPython version : 8.19.0\n",
"\n",
"Compiler : Clang 16.0.6 \n",
"OS : Darwin\n",
"Release : 23.3.0\n",
"Machine : x86_64\n",
"Processor : i386\n",
"CPU cores : 8\n",
"Architecture: 64bit\n",
"\n",
"seaborn : 0.13.0\n",
"numpy : 1.26.2\n",
"arviz : 0.17.0\n",
"pandas : 2.1.4\n",
"matplotlib: 3.8.2\n",
"scipy : 1.11.4\n",
"\n",
"Watermark: 2.4.3\n",
"\n"
]
}
],
"source": [
"%load_ext watermark\n",
"%watermark -n -u -v -iv -w -m"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "pymc",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.7"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "cbb367cc0128e23b7454d788d5a4229ca1f9848fd2e857f4797fbd26ab3b0776"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}