{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<a target=\"_blank\" href=\"https://colab.research.google.com/github/ccaudek/ds4psy/blob/main/06_pandas_functions.ipynb\">\n",
    "  <img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/>\n",
    "</a>\n",
    "\n",
    "(pandas-functions-notebook)=\n",
    "# Pandas (3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " In questo tutorial, verranno presentate le funzioni di Pandas più utili per condurre le usuali operazioni di manipolazione dei dati."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from scipy import stats\n",
    "import arviz as az\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "%config InlineBackend.figure_format = 'retina'\n",
    "RANDOM_SEED = 42\n",
    "rng = np.random.default_rng(RANDOM_SEED)\n",
    "az.style.use(\"arviz-darkgrid\")\n",
    "sns.set_theme(palette=\"colorblind\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Per questo tutorial, useremo nuovamente i dati `penguins.csv`. Come in precedenza, dopo avere caricato i dati, rimuoviamo i dati mancanti.\n",
    "\n",
    "https://regenerativetoday.com/30-very-useful-pandas-functions-for-everyday-data-analysis-tasks/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `pd.read_csv`, `pd.read_excel`\n",
    "\n",
    "La prima funzione da menzionare è `read_csv` o `read_excel`. Le funzioni vengono utilizzate per leggere un file CSV o un file Excel in formato DataFrame di Pandas. Qui stiamo utilizzando la funzione `read_csv` per leggere il dataset `penguins`. In precedenza abbiamo anche visto come la funzione `dropna` viene utilizzata per rimuovere tutte le righe del DataFrame che includono dati mancanti."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>species</th>\n",
       "      <th>island</th>\n",
       "      <th>bill_length_mm</th>\n",
       "      <th>bill_depth_mm</th>\n",
       "      <th>flipper_length_mm</th>\n",
       "      <th>body_mass_g</th>\n",
       "      <th>sex</th>\n",
       "      <th>year</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>339</th>\n",
       "      <td>Chinstrap</td>\n",
       "      <td>Dream</td>\n",
       "      <td>55.8</td>\n",
       "      <td>19.8</td>\n",
       "      <td>207.0</td>\n",
       "      <td>4000.0</td>\n",
       "      <td>male</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>340</th>\n",
       "      <td>Chinstrap</td>\n",
       "      <td>Dream</td>\n",
       "      <td>43.5</td>\n",
       "      <td>18.1</td>\n",
       "      <td>202.0</td>\n",
       "      <td>3400.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>341</th>\n",
       "      <td>Chinstrap</td>\n",
       "      <td>Dream</td>\n",
       "      <td>49.6</td>\n",
       "      <td>18.2</td>\n",
       "      <td>193.0</td>\n",
       "      <td>3775.0</td>\n",
       "      <td>male</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>342</th>\n",
       "      <td>Chinstrap</td>\n",
       "      <td>Dream</td>\n",
       "      <td>50.8</td>\n",
       "      <td>19.0</td>\n",
       "      <td>210.0</td>\n",
       "      <td>4100.0</td>\n",
       "      <td>male</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>343</th>\n",
       "      <td>Chinstrap</td>\n",
       "      <td>Dream</td>\n",
       "      <td>50.2</td>\n",
       "      <td>18.7</td>\n",
       "      <td>198.0</td>\n",
       "      <td>3775.0</td>\n",
       "      <td>female</td>\n",
       "      <td>2009</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       species island  bill_length_mm  bill_depth_mm  flipper_length_mm  \\\n",
       "339  Chinstrap  Dream            55.8           19.8              207.0   \n",
       "340  Chinstrap  Dream            43.5           18.1              202.0   \n",
       "341  Chinstrap  Dream            49.6           18.2              193.0   \n",
       "342  Chinstrap  Dream            50.8           19.0              210.0   \n",
       "343  Chinstrap  Dream            50.2           18.7              198.0   \n",
       "\n",
       "     body_mass_g     sex  year  \n",
       "339       4000.0    male  2009  \n",
       "340       3400.0  female  2009  \n",
       "341       3775.0    male  2009  \n",
       "342       4100.0    male  2009  \n",
       "343       3775.0  female  2009  "
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.read_csv(\"../data/penguins.csv\")\n",
    "df.dropna(inplace=True)\n",
    "df.tail()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `.columns`\n",
    "\n",
    "Quando si dispone di un grande dataset, può essere difficile visualizzare tutte le colonne. Utilizzando la funzione `columns`, è possibile stampare tutte le colonne del dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',\n",
       "       'flipper_length_mm', 'body_mass_g', 'sex', 'year'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `.drop()`\n",
    "\n",
    "È possibile eliminare alcune colonne non necessarie utilizzando `drop`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',\n",
       "       'flipper_length_mm', 'body_mass_g', 'sex'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = df.drop(columns=[\"year\"])\n",
    "df.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `len()`\n",
    "\n",
    "Fornisce il numero di righe di un DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "333"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `.query()`\n",
    "\n",
    "È possibile filtrare un DataFrame utilizzando un'espressione booleana."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "68"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df1 = df.query(\"species == 'Chinstrap' & island == 'Dream'\")\n",
    "len(df1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `.iloc[]`\n",
    "\n",
    "Questa funzione accetta come parametri gli indici delle righe e delle colonne, fornendo una selezione del DataFrame in base a questi. In questo caso, stiamo selezionando le prime 3 righe di dati e le colonne con indice 2, 3 e 5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>bill_length_mm</th>\n",
       "      <th>bill_depth_mm</th>\n",
       "      <th>body_mass_g</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>39.1</td>\n",
       "      <td>18.7</td>\n",
       "      <td>3750.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>39.5</td>\n",
       "      <td>17.4</td>\n",
       "      <td>3800.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>40.3</td>\n",
       "      <td>18.0</td>\n",
       "      <td>3250.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   bill_length_mm  bill_depth_mm  body_mass_g\n",
       "0            39.1           18.7       3750.0\n",
       "1            39.5           17.4       3800.0\n",
       "2            40.3           18.0       3250.0"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2 = df.iloc[:3, [2, 3, 5]]\n",
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## `.loc[]`\n",
    "\n",
    "Questa funzione compie un'operazione molto simile a quella della funzione `.iloc`. Tuttavia, in questo caso, abbiamo la possibilità di specificare gli indici delle righe che desideriamo, insieme ai nomi delle colonne che vogliamo includere nella nostra selezione."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>island</th>\n",
       "      <th>flipper_length_mm</th>\n",
       "      <th>sex</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Torgersen</td>\n",
       "      <td>195.0</td>\n",
       "      <td>female</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Torgersen</td>\n",
       "      <td>193.0</td>\n",
       "      <td>female</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Torgersen</td>\n",
       "      <td>181.0</td>\n",
       "      <td>female</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      island  flipper_length_mm     sex\n",
       "2  Torgersen              195.0  female\n",
       "4  Torgersen              193.0  female\n",
       "6  Torgersen              181.0  female"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df3 = df.loc[[2, 4, 6], [\"island\", \"flipper_length_mm\", \"sex\"]]\n",
    "df3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Last updated: Mon Jan 29 2024\n",
      "\n",
      "Python implementation: CPython\n",
      "Python version       : 3.11.7\n",
      "IPython version      : 8.19.0\n",
      "\n",
      "Compiler    : Clang 16.0.6 \n",
      "OS          : Darwin\n",
      "Release     : 23.3.0\n",
      "Machine     : x86_64\n",
      "Processor   : i386\n",
      "CPU cores   : 8\n",
      "Architecture: 64bit\n",
      "\n",
      "seaborn   : 0.13.0\n",
      "numpy     : 1.26.2\n",
      "arviz     : 0.17.0\n",
      "pandas    : 2.1.4\n",
      "matplotlib: 3.8.2\n",
      "scipy     : 1.11.4\n",
      "\n",
      "Watermark: 2.4.3\n",
      "\n"
     ]
    }
   ],
   "source": [
    "%load_ext watermark\n",
    "%watermark -n -u -v -iv -w -m"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "pymc",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "cbb367cc0128e23b7454d788d5a4229ca1f9848fd2e857f4797fbd26ab3b0776"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}