oC-n · June 30, 2021 15:15
diff --git a/PY0101EN-5.4_WorkingWithDifferentFileTypes.ipynb b/PY0101EN-5.4_WorkingWithDifferentFileTypes.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "    <img src=\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\"  />\n",
    "</center>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Table of Contents\n",
    "\n",
    "<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
    "\n",
    "<font size = 3>\n",
    "\n",
    "1.  <a href=\"#item31\">Data Engineering</a>\n",
    "2.  <a href=\"#item31\">Data Engineering Process</a>\n",
    "3.  <a href=\"#item32\">Working with different file formats</a>\n",
    "4.  <a href=\"#item33\">Data Analysis</a>\n",
    "\n",
    "</font>\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Engineering\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Data engineering** is one of the most critical and foundational skills in any data scientist’s toolkit.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Engineering Process\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are several steps in Data Engineering process.\n",
    "\n",
    "1.  **Extract** - Data extraction is getting data from multiple sources. Ex. Data extraction from a website using Web scraping or gathering information from the data that are stored in different formats(JSON, CSV, XLSX etc.).\n",
    "\n",
    "2.  **Transform** - Tarnsforming the data means removing the data that we don't need for further analysis and converting the data in the format that all the data from the multiple sources is in the same format.\n",
    "\n",
    "3.  **Load** - Loading the data inside a data warehouse. Data warehouse essentially contains large volumes of data that are accessed to gather insights.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Working with different file formats\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the real-world, people rarely get neat tabular data. Thus, it is mandatory for any data scientist (or data engineer) to be aware of different file formats, common challenges in handling them and the best, most efficient ways to handle this data in real life. We have reviewed some of this content in other modules.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### File Format\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A file format is a standard way in which information is encoded for storage in a file. First, the file format specifies whether the file is a binary or ASCII file. Second, it shows how the information is organized. For example, the comma-separated values (CSV) file format stores tabular data in plain text.\n",
    "\n",
    "To identify a file format, you can usually look at the file extension to get an idea. For example, a file saved with name “Data” in “CSV” format will appear as “Data.csv”. By noticing the “.csv” extension, we can clearly identify that it is a “CSV” file and the data is stored in a tabular format.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are various formats for a dataset, .csv, .json, .xlsx etc. The dataset can be stored in different places, on your local machine or sometimes online.\n",
    "\n",
    "**In this section, you will learn how to load a dataset into our Jupyter Notebook.**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we will look at some file formats and how to read them in Python:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Comma-separated values (CSV) file format\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **Comma-separated values** file format falls under a spreadsheet file format.\n",
    "\n",
    "In a spreadsheet file format, data is stored in cells. Each cell is organized in rows and columns. A column in the spreadsheet file can have different types. For example, a column can be of string type, a date type, or an integer type.\n",
    "\n",
    "Each line in CSV file represents an observation, or commonly called a record. Each record may contain one or more fields which are separated by a comma.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading data from CSV in Python\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **Pandas** Library is a useful tool that enables us to read various datasets into a Pandas data frame\n",
    "\n",
    "Let us look at how to read a CSV file in Pandas Library.\n",
    "\n",
    "We use **pandas.read_csv()** function to read the csv file. In the parentheses, we put the file path along with a quotation mark as an argument, so that pandas will read the file into a data frame from that address. The file path can be either a URL or your local file address.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "url ='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%205/data/addresses.csv'\n",
    "df = pd.read_csv(url)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>John</th>\n",
       "      <th>Doe</th>\n",
       "      <th>120 jefferson st.</th>\n",
       "      <th>Riverside</th>\n",
       "      <th>NJ</th>\n",
       "      <th>08075</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jack</td>\n",
       "      <td>McGinnis</td>\n",
       "      <td>220 hobo Av.</td>\n",
       "      <td>Phila</td>\n",
       "      <td>PA</td>\n",
       "      <td>9119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>John \"Da Man\"</td>\n",
       "      <td>Repici</td>\n",
       "      <td>120 Jefferson St.</td>\n",
       "      <td>Riverside</td>\n",
       "      <td>NJ</td>\n",
       "      <td>8075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Tyler</td>\n",
       "      <td>7452 Terrace \"At the Plaza\" road</td>\n",
       "      <td>SomeTown</td>\n",
       "      <td>SD</td>\n",
       "      <td>91234</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Blankman</td>\n",
       "      <td>NaN</td>\n",
       "      <td>SomeTown</td>\n",
       "      <td>SD</td>\n",
       "      <td>298</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Joan \"the bone\", Anne</td>\n",
       "      <td>Jet</td>\n",
       "      <td>9th, at Terrace plc</td>\n",
       "      <td>Desert City</td>\n",
       "      <td>CO</td>\n",
       "      <td>123</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    John       Doe                 120 jefferson st.  \\\n",
       "0                   Jack  McGinnis                      220 hobo Av.   \n",
       "1          John \"Da Man\"    Repici                 120 Jefferson St.   \n",
       "2                Stephen     Tyler  7452 Terrace \"At the Plaza\" road   \n",
       "3                    NaN  Blankman                               NaN   \n",
       "4  Joan \"the bone\", Anne       Jet               9th, at Terrace plc   \n",
       "\n",
       "     Riverside   NJ   08075  \n",
       "0        Phila   PA    9119  \n",
       "1    Riverside   NJ    8075  \n",
       "2     SomeTown   SD   91234  \n",
       "3     SomeTown   SD     298  \n",
       "4  Desert City   CO     123  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Adding column name to the DataFrame\n",
    "\n",
    "We can add columns to an existing DataFrame using its **columns** attribute.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.columns =['First Name', 'Last Name', 'Location ', 'City','State','Area Code']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>First Name</th>\n",
       "      <th>Last Name</th>\n",
       "      <th>Location</th>\n",
       "      <th>City</th>\n",
       "      <th>State</th>\n",
       "      <th>Area Code</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jack</td>\n",
       "      <td>McGinnis</td>\n",
       "      <td>220 hobo Av.</td>\n",
       "      <td>Phila</td>\n",
       "      <td>PA</td>\n",
       "      <td>9119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>John \"Da Man\"</td>\n",
       "      <td>Repici</td>\n",
       "      <td>120 Jefferson St.</td>\n",
       "      <td>Riverside</td>\n",
       "      <td>NJ</td>\n",
       "      <td>8075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Tyler</td>\n",
       "      <td>7452 Terrace \"At the Plaza\" road</td>\n",
       "      <td>SomeTown</td>\n",
       "      <td>SD</td>\n",
       "      <td>91234</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Blankman</td>\n",
       "      <td>NaN</td>\n",
       "      <td>SomeTown</td>\n",
       "      <td>SD</td>\n",
       "      <td>298</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Joan \"the bone\", Anne</td>\n",
       "      <td>Jet</td>\n",
       "      <td>9th, at Terrace plc</td>\n",
       "      <td>Desert City</td>\n",
       "      <td>CO</td>\n",
       "      <td>123</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              First Name Last Name                         Location   \\\n",
       "0                   Jack  McGinnis                      220 hobo Av.   \n",
       "1          John \"Da Man\"    Repici                 120 Jefferson St.   \n",
       "2                Stephen     Tyler  7452 Terrace \"At the Plaza\" road   \n",
       "3                    NaN  Blankman                               NaN   \n",
       "4  Joan \"the bone\", Anne       Jet               9th, at Terrace plc   \n",
       "\n",
       "          City State  Area Code  \n",
       "0        Phila    PA       9119  \n",
       "1    Riverside    NJ       8075  \n",
       "2     SomeTown    SD      91234  \n",
       "3     SomeTown    SD        298  \n",
       "4  Desert City    CO        123  "
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>First Name</th>\n",
       "      <th>Last Name</th>\n",
       "      <th>Location</th>\n",
       "      <th>City</th>\n",
       "      <th>State</th>\n",
       "      <th>Area Code</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "Empty DataFrame\n",
       "Columns: [First Name, Last Name, Location , City, State, Area Code]\n",
       "Index: []"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# We see here that the assumed header has been replaced with the one specified, but this means we lose the first line of data.\n",
    "# Let's see whether pre-defining the dataframe columns would have avoided this.\n",
    "df1 = pd.DataFrame(columns = ['First Name', 'Last Name', 'Location ', 'City','State','Area Code'])\n",
    "df1.append(pd.read_csv(url))\n",
    "df1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>First Name</th>\n",
       "      <th>Last Name</th>\n",
       "      <th>Location</th>\n",
       "      <th>City</th>\n",
       "      <th>State</th>\n",
       "      <th>Area Code</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>John</td>\n",
       "      <td>Doe</td>\n",
       "      <td>120 jefferson st.</td>\n",
       "      <td>Riverside</td>\n",
       "      <td>NJ</td>\n",
       "      <td>8075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Jack</td>\n",
       "      <td>McGinnis</td>\n",
       "      <td>220 hobo Av.</td>\n",
       "      <td>Phila</td>\n",
       "      <td>PA</td>\n",
       "      <td>9119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>John \"Da Man\"</td>\n",
       "      <td>Repici</td>\n",
       "      <td>120 Jefferson St.</td>\n",
       "      <td>Riverside</td>\n",
       "      <td>NJ</td>\n",
       "      <td>8075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Tyler</td>\n",
       "      <td>7452 Terrace \"At the Plaza\" road</td>\n",
       "      <td>SomeTown</td>\n",
       "      <td>SD</td>\n",
       "      <td>91234</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Blankman</td>\n",
       "      <td>NaN</td>\n",
       "      <td>SomeTown</td>\n",
       "      <td>SD</td>\n",
       "      <td>298</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Joan \"the bone\", Anne</td>\n",
       "      <td>Jet</td>\n",
       "      <td>9th, at Terrace plc</td>\n",
       "      <td>Desert City</td>\n",
       "      <td>CO</td>\n",
       "      <td>123</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              First Name Last Name                         Location   \\\n",
       "0                   John       Doe                 120 jefferson st.   \n",
       "1                   Jack  McGinnis                      220 hobo Av.   \n",
       "2          John \"Da Man\"    Repici                 120 Jefferson St.   \n",
       "3                Stephen     Tyler  7452 Terrace \"At the Plaza\" road   \n",
       "4                    NaN  Blankman                               NaN   \n",
       "5  Joan \"the bone\", Anne       Jet               9th, at Terrace plc   \n",
       "\n",
       "          City State  Area Code  \n",
       "0    Riverside    NJ       8075  \n",
       "1        Phila    PA       9119  \n",
       "2    Riverside    NJ       8075  \n",
       "3     SomeTown    SD      91234  \n",
       "4     SomeTown    SD        298  \n",
       "5  Desert City    CO        123  "
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# No, so there must be some other kind of workaround.\n",
    "# Yes. read_csv lets us specify:\n",
    "df2 = pd.read_csv(url, header=None, names=['First Name', 'Last Name', 'Location ', 'City','State','Area Code'])\n",
    "df2\n",
    "# see https://pandas.pydata.org/docs/user_guide/io.html?highlight=pandas%20csv_read#io-read-csv-table"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# append would work if we looped through, line by line; see below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Selecting a single column\n",
    "\n",
    "To select the first column 'First Name', you can pass the column name as a string to the indexing operator.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0                     Jack\n",
       "1            John \"Da Man\"\n",
       "2                  Stephen\n",
       "3                      NaN\n",
       "4    Joan \"the bone\", Anne\n",
       "Name: First Name, dtype: object"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"First Name\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Selecting multiple columns\n",
    "\n",
    "To select multiple columns, you can pass a list of column names to the indexing operator.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>First Name</th>\n",
       "      <th>Last Name</th>\n",
       "      <th>Location</th>\n",
       "      <th>City</th>\n",
       "      <th>State</th>\n",
       "      <th>Area Code</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Jack</td>\n",
       "      <td>McGinnis</td>\n",
       "      <td>220 hobo Av.</td>\n",
       "      <td>Phila</td>\n",
       "      <td>PA</td>\n",
       "      <td>9119</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>John \"Da Man\"</td>\n",
       "      <td>Repici</td>\n",
       "      <td>120 Jefferson St.</td>\n",
       "      <td>Riverside</td>\n",
       "      <td>NJ</td>\n",
       "      <td>8075</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Stephen</td>\n",
       "      <td>Tyler</td>\n",
       "      <td>7452 Terrace \"At the Plaza\" road</td>\n",
       "      <td>SomeTown</td>\n",
       "      <td>SD</td>\n",
       "      <td>91234</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>Blankman</td>\n",
       "      <td>NaN</td>\n",
       "      <td>SomeTown</td>\n",
       "      <td>SD</td>\n",
       "      <td>298</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Joan \"the bone\", Anne</td>\n",
       "      <td>Jet</td>\n",
       "      <td>9th, at Terrace plc</td>\n",
       "      <td>Desert City</td>\n",
       "      <td>CO</td>\n",
       "      <td>123</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              First Name Last Name                         Location   \\\n",
       "0                   Jack  McGinnis                      220 hobo Av.   \n",
       "1          John \"Da Man\"    Repici                 120 Jefferson St.   \n",
       "2                Stephen     Tyler  7452 Terrace \"At the Plaza\" road   \n",
       "3                    NaN  Blankman                               NaN   \n",
       "4  Joan \"the bone\", Anne       Jet               9th, at Terrace plc   \n",
       "\n",
       "          City State  Area Code  \n",
       "0        Phila    PA       9119  \n",
       "1    Riverside    NJ       8075  \n",
       "2     SomeTown    SD      91234  \n",
       "3     SomeTown    SD        298  \n",
       "4  Desert City    CO        123  "
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = df[['First Name', 'Last Name', 'Location ', 'City','State','Area Code']]\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Selecting rows using .iloc and .loc\n",
    "\n",
    "Now, let's see how to use .loc for selecting rows from our DataFrame.\n",
    "\n",
    "**loc() : loc() is label based data selecting method which means that we have to pass the name of the row or column which we want to select.**\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "First Name            Jack\n",
       "Last Name         McGinnis\n",
       "Location      220 hobo Av.\n",
       "City                 Phila\n",
       "State                   PA\n",
       "Area Code             9119\n",
       "Name: 0, dtype: object"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# To select the first row\n",
    "df.loc[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0             Jack\n",
       "1    John \"Da Man\"\n",
       "2          Stephen\n",
       "Name: First Name, dtype: object"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# To select the 0th,1st and 2nd row of \"First Name\" column only\n",
    "df.loc[[0,1,2], \"First Name\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's see how to use .iloc for selecting rows from our DataFrame.\n",
    "\n",
    "**iloc() : iloc() is a indexed based selecting method which means that we have to pass integer index in the method to select specific row/column.**\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0             Jack\n",
       "1    John \"Da Man\"\n",
       "2          Stephen\n",
       "Name: First Name, dtype: object"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# To select the 0th,1st and 2nd row of \"First Name\" column only\n",
    "df.iloc[[0,1,2], 0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For more information please read the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html?utm_medium=Exinfluencer\\&utm_source=Exinfluencer\\&utm_content=000026UJ\\&utm_term=10006555\\&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01).\n",
    "\n",
    "Let perform some basic transformation in pandas.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Transform Function in Pandas\n",
    "\n",
    "Python’s Transform function returns a self-produced dataframe with transformed values after applying the function specified in its parameter.\n",
    "\n",
    "Let's see how Transform function works.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "#import library\n",
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>7</td>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   a  b  c\n",
       "0  1  2  3\n",
       "1  4  5  6\n",
       "2  7  8  9"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#creating a dataframe\n",
    "df=pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s say we want to add 10 to each element in a dataframe:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>11</td>\n",
       "      <td>12</td>\n",
       "      <td>13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>14</td>\n",
       "      <td>15</td>\n",
       "      <td>16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>17</td>\n",
       "      <td>18</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    a   b   c\n",
       "0  11  12  13\n",
       "1  14  15  16\n",
       "2  17  18  19"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#applying the transform function\n",
    "df = df.transform(func = lambda x : x + 10)\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will use DataFrame.transform() function to find the square root to each element of the dataframe.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = df.transform(func = ['sqrt'])\n",
    "# Omitting the square brackets still works, but doesn't show sqrt in the header row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>a</th>\n",
       "      <th>b</th>\n",
       "      <th>c</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>sqrt</th>\n",
       "      <th>sqrt</th>\n",
       "      <th>sqrt</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>3.316625</td>\n",
       "      <td>3.464102</td>\n",
       "      <td>3.605551</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3.741657</td>\n",
       "      <td>3.872983</td>\n",
       "      <td>4.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4.123106</td>\n",
       "      <td>4.242641</td>\n",
       "      <td>4.358899</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          a         b         c\n",
       "       sqrt      sqrt      sqrt\n",
       "0  3.316625  3.464102  3.605551\n",
       "1  3.741657  3.872983  4.000000\n",
       "2  4.123106  4.242641  4.358899"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For more information about the **transform()** function  please read the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html?utm_medium=Exinfluencer\\&utm_source=Exinfluencer\\&utm_content=000026UJ\\&utm_term=10006555\\&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# JSON file Format\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**JSON (JavaScript Object Notation)** is a lightweight data-interchange format. It is easy for humans to read and write.\n",
    "\n",
    "JSON is built on two structures:\n",
    "\n",
    "1.  A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.\n",
    "\n",
    "2.  An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.\n",
    "\n",
    "JSON is a language-independent data format. It was derived from JavaScript, but many modern programming languages include code to generate and parse JSON-format data. It is a very common data format with a diverse range of applications.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The text in JSON is done through quoted string which contains the values in key-value mappings within { }. It is similar to the dictionary in Python.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Python supports JSON through a built-in package called **json**. To use this feature, we import the json package in Python script.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Writing JSON to a File\n",
    "\n",
    "This is usually called **serialization**. It is the process of converting an object into a special format which is suitable for transmitting over the network or storing in file or database.\n",
    "\n",
    "To handle the data flow in a file, the JSON library in Python uses the **dump()** or **dumps()** function to convert the Python objects into their respective JSON object. This makes it easy to write data to files.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "person = {\n",
    "    'first_name' : 'Mark',\n",
    "    'last_name' : 'abc',\n",
    "    'age' : 27,\n",
    "    'address': {\n",
    "        \"streetAddress\": \"21 2nd Street\",\n",
    "        \"city\": \"New York\",\n",
    "        \"state\": \"NY\",\n",
    "        \"postalCode\": \"10021-3100\"\n",
    "    }\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### serialization using dump() function\n",
    "\n",
    "**json.dump()** method can be used for writing to JSON file.\n",
    "\n",
    "Syntax: json.dump(dict, file_pointer)\n",
    "\n",
    "Parameters:\n",
    "\n",
    "1.  **dictionary** – name of the dictionary which should be converted to JSON object.\n",
    "2.  **file pointer** – pointer of the file opened in write or append mode.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('person.json', 'w') as f:  # writing JSON object\n",
    "    json.dump(person, f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### serialization using dumps() function\n",
    "\n",
    "**json.dumps()** that helps in converting a dictionary to a JSON object.\n",
    "\n",
    "It takes two parameters:\n",
    "\n",
    "1.  **dictionary** – name of the dictionary which should be converted to JSON object.\n",
    "2.  **indent** – defines the number of units for indentation\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Serializing json  \n",
    "json_object = json.dumps(person, indent = 4) \n",
    "  \n",
    "# Writing to sample.json \n",
    "with open(\"sample.json\", \"w\") as outfile: \n",
    "    outfile.write(json_object) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "    \"first_name\": \"Mark\",\n",
      "    \"last_name\": \"abc\",\n",
      "    \"age\": 27,\n",
      "    \"address\": {\n",
      "        \"streetAddress\": \"21 2nd Street\",\n",
      "        \"city\": \"New York\",\n",
      "        \"state\": \"NY\",\n",
      "        \"postalCode\": \"10021-3100\"\n",
      "    }\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "print(json_object)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# dumps writes to a string.\n",
    "# dump writes to a file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our Python objects are now serialized to the file. To deserialize it back to the Python object, we use the load() function.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Reading JSON to a File\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This process is usually called **Deserialization** - it is the reverse of serialization. It converts the special format returned by the serialization back into a usable object.\n",
    "\n",
    "### Using json.load()\n",
    "\n",
    "The JSON package has json.load() function that loads the json content from a json file into a dictionary.\n",
    "\n",
    "It takes one parameter:\n",
    "\n",
    "File pointer: A file pointer that points to a JSON file.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'first_name': 'Mark', 'last_name': 'abc', 'age': 27, 'address': {'streetAddress': '21 2nd Street', 'city': 'New York', 'state': 'NY', 'postalCode': '10021-3100'}}\n",
      "<class 'dict'>\n"
     ]
    }
   ],
   "source": [
    "import json \n",
    "  \n",
    "# Opening JSON file \n",
    "with open('sample.json', 'r') as openfile: \n",
    "  \n",
    "    # Reading from json file \n",
    "    json_object = json.load(openfile) \n",
    "  \n",
    "print(json_object) \n",
    "print(type(json_object)) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# XLSX file format\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**XLSX** is a Microsoft Excel Open XML file format. It is another type of Spreadsheet file format.\n",
    "\n",
    "In XLSX data is organized under the cells and columns in a sheet.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reading the data from XLSX file\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let’s load the data from XLSX file and define the sheet name. For loading the data you can use the Pandas library in python.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "import urllib.request\n",
    "\n",
    "urllib.request.urlretrieve(\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%205/data/file_example_XLSX_10.xlsx\", \"sample.xlsx\")\n",
    "df = pd.read_excel(\"sample.xlsx\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>First Name</th>\n",
       "      <th>Last Name</th>\n",
       "      <th>Gender</th>\n",
       "      <th>Country</th>\n",
       "      <th>Age</th>\n",
       "      <th>Date</th>\n",
       "      <th>Id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Dulce</td>\n",
       "      <td>Abril</td>\n",
       "      <td>Female</td>\n",
       "      <td>United States</td>\n",
       "      <td>32</td>\n",
       "      <td>15/10/2017</td>\n",
       "      <td>1562</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Mara</td>\n",
       "      <td>Hashimoto</td>\n",
       "      <td>Female</td>\n",
       "      <td>Great Britain</td>\n",
       "      <td>25</td>\n",
       "      <td>16/08/2016</td>\n",
       "      <td>1582</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Philip</td>\n",
       "      <td>Gent</td>\n",
       "      <td>Male</td>\n",
       "      <td>France</td>\n",
       "      <td>36</td>\n",
       "      <td>21/05/2015</td>\n",
       "      <td>2587</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>Kathleen</td>\n",
       "      <td>Hanner</td>\n",
       "      <td>Female</td>\n",
       "      <td>United States</td>\n",
       "      <td>25</td>\n",
       "      <td>15/10/2017</td>\n",
       "      <td>3549</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>Nereida</td>\n",
       "      <td>Magwood</td>\n",
       "      <td>Female</td>\n",
       "      <td>United States</td>\n",
       "      <td>58</td>\n",
       "      <td>16/08/2016</td>\n",
       "      <td>2468</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>6</td>\n",
       "      <td>Gaston</td>\n",
       "      <td>Brumm</td>\n",
       "      <td>Male</td>\n",
       "      <td>United States</td>\n",
       "      <td>24</td>\n",
       "      <td>21/05/2015</td>\n",
       "      <td>2554</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>7</td>\n",
       "      <td>Etta</td>\n",
       "      <td>Hurn</td>\n",
       "      <td>Female</td>\n",
       "      <td>Great Britain</td>\n",
       "      <td>56</td>\n",
       "      <td>15/10/2017</td>\n",
       "      <td>3598</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>8</td>\n",
       "      <td>Earlean</td>\n",
       "      <td>Melgar</td>\n",
       "      <td>Female</td>\n",
       "      <td>United States</td>\n",
       "      <td>27</td>\n",
       "      <td>16/08/2016</td>\n",
       "      <td>2456</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>9</td>\n",
       "      <td>Vincenza</td>\n",
       "      <td>Weiland</td>\n",
       "      <td>Female</td>\n",
       "      <td>United States</td>\n",
       "      <td>40</td>\n",
       "      <td>21/05/2015</td>\n",
       "      <td>6548</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0 First Name  Last Name  Gender        Country  Age        Date    Id\n",
       "0  1      Dulce      Abril  Female  United States   32  15/10/2017  1562\n",
       "1  2       Mara  Hashimoto  Female  Great Britain   25  16/08/2016  1582\n",
       "2  3     Philip       Gent    Male         France   36  21/05/2015  2587\n",
       "3  4   Kathleen     Hanner  Female  United States   25  15/10/2017  3549\n",
       "4  5    Nereida    Magwood  Female  United States   58  16/08/2016  2468\n",
       "5  6     Gaston      Brumm    Male  United States   24  21/05/2015  2554\n",
       "6  7       Etta       Hurn  Female  Great Britain   56  15/10/2017  3598\n",
       "7  8    Earlean     Melgar  Female  United States   27  16/08/2016  2456\n",
       "8  9   Vincenza    Weiland  Female  United States   40  21/05/2015  6548"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# XML file format\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**XML is also known as Extensible Markup Language**. As the name suggests, it is a markup language. It has certain rules for encoding data. XML file format is a human-readable and machine-readable file format.\n",
    "\n",
    "Pandas does not include any methods to read and write XML files. Here, we will take a look at how we can use other modules to read data from an XML file, and load it into a Pandas DataFrame.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Writing with xml.etree.ElementTree\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The **xml.etree.ElementTree** module comes built-in with Python. It provides functionality for parsing and creating XML documents. ElementTree represents the XML document as a tree. We can move across the document using nodes which are elements and sub-elements of the XML file.\n",
    "\n",
    "For more information please read the [xml.etree.ElementTree](https://docs.python.org/3/library/xml.etree.elementtree.html?utm_medium=Exinfluencer\\&utm_source=Exinfluencer\\&utm_content=000026UJ\\&utm_term=10006555\\&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01) documentation.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "import xml.etree.ElementTree as ET\n",
    "\n",
    "# create the file structure\n",
    "employee = ET.Element('employee')\n",
    "details = ET.SubElement(employee, 'details')\n",
    "first = ET.SubElement(details, 'firstname')\n",
    "second = ET.SubElement(details, 'lastname')\n",
    "third = ET.SubElement(details, 'age')\n",
    "first.text = 'Shiv'\n",
    "second.text = 'Mishra'\n",
    "third.text = '23'\n",
    "\n",
    "# create a new XML file with the results\n",
    "mydata1 = ET.ElementTree(employee)\n",
    "# myfile = open(\"items2.xml\", \"wb\")\n",
    "# myfile.write(mydata)\n",
    "with open(\"new_sample.xml\", \"wb\") as files:\n",
    "    mydata1.write(files)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<xml.etree.ElementTree.ElementTree at 0x7f63a3d6c358>"
      ]
     },
     "execution_count": 57,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mydata1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reading with xml.etree.ElementTree\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's have a look at a one way to read XML data and put it in a Pandas DataFrame. You can see the XML file in the Notepad of your local machine.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2021-06-30 14:55:25--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%205/data/Sample-employee-XML-file.xml\n",
      "Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104\n",
      "Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 1016 [application/xml]\n",
      "Saving to: ‘Sample-employee-XML-file.xml’\n",
      "\n",
      "Sample-employee-XML 100%[===================>]    1016  --.-KB/s    in 0s      \n",
      "\n",
      "2021-06-30 14:55:25 (24.2 MB/s) - ‘Sample-employee-XML-file.xml’ saved [1016/1016]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd \n",
    "\n",
    "import xml.etree.ElementTree as etree\n",
    "\n",
    "!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%205/data/Sample-employee-XML-file.xml"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You would need to firstly parse an XML file and create a list of columns for data frame, then extract useful information from the XML file and add to a pandas data frame.\n",
    "\n",
    "Here is a sample code that you can use.:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [],
   "source": [
    "tree = etree.parse(\"Sample-employee-XML-file.xml\")\n",
    "\n",
    "root = tree.getroot()\n",
    "columns = [\"firstname\", \"lastname\", \"title\", \"division\", \"building\",\"room\"]\n",
    "\n",
    "datatframe = pd.DataFrame(columns = columns)\n",
    "\n",
    "for node in root: \n",
    "\n",
    "    firstname = node.find(\"firstname\").text\n",
    "\n",
    "    lastname = node.find(\"lastname\").text \n",
    "\n",
    "    title = node.find(\"title\").text \n",
    "    \n",
    "    division = node.find(\"division\").text \n",
    "    \n",
    "    building = node.find(\"building\").text\n",
    "    \n",
    "    room = node.find(\"room\").text\n",
    "    \n",
    "    datatframe = datatframe.append(pd.Series([firstname, lastname, title, division, building, room], index = columns), ignore_index = True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(root)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 60,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>firstname</th>\n",
       "      <th>lastname</th>\n",
       "      <th>title</th>\n",
       "      <th>division</th>\n",
       "      <th>building</th>\n",
       "      <th>room</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Shiv</td>\n",
       "      <td>Mishra</td>\n",
       "      <td>Engineer</td>\n",
       "      <td>Computer</td>\n",
       "      <td>301</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Yuh</td>\n",
       "      <td>Datta</td>\n",
       "      <td>developer</td>\n",
       "      <td>Computer</td>\n",
       "      <td>303</td>\n",
       "      <td>02</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Rahil</td>\n",
       "      <td>Khan</td>\n",
       "      <td>Tester</td>\n",
       "      <td>Computer</td>\n",
       "      <td>304</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Deep</td>\n",
       "      <td>Parekh</td>\n",
       "      <td>Designer</td>\n",
       "      <td>Computer</td>\n",
       "      <td>305</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  firstname lastname      title  division building room\n",
       "0      Shiv   Mishra   Engineer  Computer      301   11\n",
       "1       Yuh    Datta  developer  Computer      303   02\n",
       "2     Rahil     Khan     Tester  Computer      304   10\n",
       "3      Deep   Parekh   Designer  Computer      305   14"
      ]
     },
     "execution_count": 60,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "datatframe"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save Data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Correspondingly, Pandas enables us to save the dataset to csv by using the **dataframe.to_csv()** method, you can add the file path and name along with quotation marks in the parentheses.\n",
    "\n",
    "For example, if you would save the dataframe df as **employee.csv** to your local machine, you may use the syntax below:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 63,
   "metadata": {},
   "outputs": [],
   "source": [
    "datatframe.to_csv(\"employee.csv\", index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also read and save other file formats, we can use similar functions to **`pd.read_csv()`** and **`df.to_csv()`** for other data formats. The functions are listed in the following table:\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h2>Read/Save Other Data Formats</h2>\n",
    "\n",
    "| Data Formate  | Read           | Save             |\n",
    "| ------------- |:--------------:| ----------------:|\n",
    "| csv           | `pd.read_csv()`  |`df.to_csv()`     |\n",
    "| json          | `pd.read_json()` |`df.to_json()`    |\n",
    "| excel         | `pd.read_excel()`|`df.to_excel()`   |\n",
    "| hdf           | `pd.read_hdf()`  |`df.to_hdf()`     |\n",
    "| sql           | `pd.read_sql()`  |`df.to_sql()`     |\n",
    "| ...           |   ...          |       ...        |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's move ahead and perform some **Data Analysis**.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Binary File Format\n",
    "\n",
    "\"Binary\" files are any files where the format isn't made up of readable characters. It contain formatting information that only certain applications or processors can understand. While humans can read text files, binary files must be run on the appropriate software or processor before humans can read them.\n",
    "\n",
    "Binary files can range from image files like JPEGs or GIFs, audio files like MP3s or binary document formats like Word or PDF.\n",
    "\n",
    "Let's see how to read an **Image** file.\n",
    "\n",
    "## Reading the Image file\n",
    "\n",
    "Python supports very powerful tools when it comes to image processing. Let’s see how to process the images using the **PIL** library.\n",
    "\n",
    "PIL is the Python Imaging Library which provides the python interpreter with image editing capabilities.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('dog.jpg', <http.client.HTTPMessage at 0x7f63a3c84978>)"
      ]
     },
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# importing PIL \n",
    "from PIL import Image \n",
    "\n",
    "import urllib.request\n",
    "# Downloading dataset\n",
    "urllib.request.urlretrieve(\"https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/dog-puppy-on-garden-royalty-free-image-1586966191.jpg\", \"dog.jpg\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
No results found