Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save oC-n/80b928641ca1d4f052d3870ee271cc61 to your computer and use it in GitHub Desktop.

Select an option

Save oC-n/80b928641ca1d4f052d3870ee271cc61 to your computer and use it in GitHub Desktop.
Python for Data Science, AI & Development - Week 5 - Hands-on Lab: Working with different file formats
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>\n",
" <img src=\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/IDSNlogo.png\" width=\"300\" alt=\"cognitiveclass.ai logo\" />\n",
"</center>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Table of Contents\n",
"\n",
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
"\n",
"<font size = 3>\n",
"\n",
"1. <a href=\"#item31\">Data Engineering</a>\n",
"2. <a href=\"#item31\">Data Engineering Process</a>\n",
"3. <a href=\"#item32\">Working with different file formats</a>\n",
"4. <a href=\"#item33\">Data Analysis</a>\n",
"\n",
"</font>\n",
"</div>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Engineering\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Data engineering** is one of the most critical and foundational skills in any data scientist’s toolkit.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Engineering Process\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are several steps in Data Engineering process.\n",
"\n",
"1. **Extract** - Data extraction is getting data from multiple sources. Ex. Data extraction from a website using Web scraping or gathering information from the data that are stored in different formats(JSON, CSV, XLSX etc.).\n",
"\n",
"2. **Transform** - Tarnsforming the data means removing the data that we don't need for further analysis and converting the data in the format that all the data from the multiple sources is in the same format.\n",
"\n",
"3. **Load** - Loading the data inside a data warehouse. Data warehouse essentially contains large volumes of data that are accessed to gather insights.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Working with different file formats\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the real-world, people rarely get neat tabular data. Thus, it is mandatory for any data scientist (or data engineer) to be aware of different file formats, common challenges in handling them and the best, most efficient ways to handle this data in real life. We have reviewed some of this content in other modules.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### File Format\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A file format is a standard way in which information is encoded for storage in a file. First, the file format specifies whether the file is a binary or ASCII file. Second, it shows how the information is organized. For example, the comma-separated values (CSV) file format stores tabular data in plain text.\n",
"\n",
"To identify a file format, you can usually look at the file extension to get an idea. For example, a file saved with name “Data” in “CSV” format will appear as “Data.csv”. By noticing the “.csv” extension, we can clearly identify that it is a “CSV” file and the data is stored in a tabular format.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are various formats for a dataset, .csv, .json, .xlsx etc. The dataset can be stored in different places, on your local machine or sometimes online.\n",
"\n",
"**In this section, you will learn how to load a dataset into our Jupyter Notebook.**\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we will look at some file formats and how to read them in Python:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Comma-separated values (CSV) file format\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The **Comma-separated values** file format falls under a spreadsheet file format.\n",
"\n",
"In a spreadsheet file format, data is stored in cells. Each cell is organized in rows and columns. A column in the spreadsheet file can have different types. For example, a column can be of string type, a date type, or an integer type.\n",
"\n",
"Each line in CSV file represents an observation, or commonly called a record. Each record may contain one or more fields which are separated by a comma.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading data from CSV in Python\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The **Pandas** Library is a useful tool that enables us to read various datasets into a Pandas data frame\n",
"\n",
"Let us look at how to read a CSV file in Pandas Library.\n",
"\n",
"We use **pandas.read_csv()** function to read the csv file. In the parentheses, we put the file path along with a quotation mark as an argument, so that pandas will read the file into a data frame from that address. The file path can be either a URL or your local file address.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"url ='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%205/data/addresses.csv'\n",
"df = pd.read_csv(url)"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>John</th>\n",
" <th>Doe</th>\n",
" <th>120 jefferson st.</th>\n",
" <th>Riverside</th>\n",
" <th>NJ</th>\n",
" <th>08075</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Jack</td>\n",
" <td>McGinnis</td>\n",
" <td>220 hobo Av.</td>\n",
" <td>Phila</td>\n",
" <td>PA</td>\n",
" <td>9119</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>John \"Da Man\"</td>\n",
" <td>Repici</td>\n",
" <td>120 Jefferson St.</td>\n",
" <td>Riverside</td>\n",
" <td>NJ</td>\n",
" <td>8075</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Stephen</td>\n",
" <td>Tyler</td>\n",
" <td>7452 Terrace \"At the Plaza\" road</td>\n",
" <td>SomeTown</td>\n",
" <td>SD</td>\n",
" <td>91234</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>Blankman</td>\n",
" <td>NaN</td>\n",
" <td>SomeTown</td>\n",
" <td>SD</td>\n",
" <td>298</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Joan \"the bone\", Anne</td>\n",
" <td>Jet</td>\n",
" <td>9th, at Terrace plc</td>\n",
" <td>Desert City</td>\n",
" <td>CO</td>\n",
" <td>123</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" John Doe 120 jefferson st. \\\n",
"0 Jack McGinnis 220 hobo Av. \n",
"1 John \"Da Man\" Repici 120 Jefferson St. \n",
"2 Stephen Tyler 7452 Terrace \"At the Plaza\" road \n",
"3 NaN Blankman NaN \n",
"4 Joan \"the bone\", Anne Jet 9th, at Terrace plc \n",
"\n",
" Riverside NJ 08075 \n",
"0 Phila PA 9119 \n",
"1 Riverside NJ 8075 \n",
"2 SomeTown SD 91234 \n",
"3 SomeTown SD 298 \n",
"4 Desert City CO 123 "
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Adding column name to the DataFrame\n",
"\n",
"We can add columns to an existing DataFrame using its **columns** attribute.\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"df.columns =['First Name', 'Last Name', 'Location ', 'City','State','Area Code']"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>First Name</th>\n",
" <th>Last Name</th>\n",
" <th>Location</th>\n",
" <th>City</th>\n",
" <th>State</th>\n",
" <th>Area Code</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Jack</td>\n",
" <td>McGinnis</td>\n",
" <td>220 hobo Av.</td>\n",
" <td>Phila</td>\n",
" <td>PA</td>\n",
" <td>9119</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>John \"Da Man\"</td>\n",
" <td>Repici</td>\n",
" <td>120 Jefferson St.</td>\n",
" <td>Riverside</td>\n",
" <td>NJ</td>\n",
" <td>8075</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Stephen</td>\n",
" <td>Tyler</td>\n",
" <td>7452 Terrace \"At the Plaza\" road</td>\n",
" <td>SomeTown</td>\n",
" <td>SD</td>\n",
" <td>91234</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>Blankman</td>\n",
" <td>NaN</td>\n",
" <td>SomeTown</td>\n",
" <td>SD</td>\n",
" <td>298</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Joan \"the bone\", Anne</td>\n",
" <td>Jet</td>\n",
" <td>9th, at Terrace plc</td>\n",
" <td>Desert City</td>\n",
" <td>CO</td>\n",
" <td>123</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" First Name Last Name Location \\\n",
"0 Jack McGinnis 220 hobo Av. \n",
"1 John \"Da Man\" Repici 120 Jefferson St. \n",
"2 Stephen Tyler 7452 Terrace \"At the Plaza\" road \n",
"3 NaN Blankman NaN \n",
"4 Joan \"the bone\", Anne Jet 9th, at Terrace plc \n",
"\n",
" City State Area Code \n",
"0 Phila PA 9119 \n",
"1 Riverside NJ 8075 \n",
"2 SomeTown SD 91234 \n",
"3 SomeTown SD 298 \n",
"4 Desert City CO 123 "
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>First Name</th>\n",
" <th>Last Name</th>\n",
" <th>Location</th>\n",
" <th>City</th>\n",
" <th>State</th>\n",
" <th>Area Code</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"Empty DataFrame\n",
"Columns: [First Name, Last Name, Location , City, State, Area Code]\n",
"Index: []"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# We see here that the assumed header has been replaced with the one specified, but this means we lose the first line of data.\n",
"# Let's see whether pre-defining the dataframe columns would have avoided this.\n",
"df1 = pd.DataFrame(columns = ['First Name', 'Last Name', 'Location ', 'City','State','Area Code'])\n",
"df1.append(pd.read_csv(url))\n",
"df1"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>First Name</th>\n",
" <th>Last Name</th>\n",
" <th>Location</th>\n",
" <th>City</th>\n",
" <th>State</th>\n",
" <th>Area Code</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>John</td>\n",
" <td>Doe</td>\n",
" <td>120 jefferson st.</td>\n",
" <td>Riverside</td>\n",
" <td>NJ</td>\n",
" <td>8075</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Jack</td>\n",
" <td>McGinnis</td>\n",
" <td>220 hobo Av.</td>\n",
" <td>Phila</td>\n",
" <td>PA</td>\n",
" <td>9119</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>John \"Da Man\"</td>\n",
" <td>Repici</td>\n",
" <td>120 Jefferson St.</td>\n",
" <td>Riverside</td>\n",
" <td>NJ</td>\n",
" <td>8075</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Stephen</td>\n",
" <td>Tyler</td>\n",
" <td>7452 Terrace \"At the Plaza\" road</td>\n",
" <td>SomeTown</td>\n",
" <td>SD</td>\n",
" <td>91234</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>NaN</td>\n",
" <td>Blankman</td>\n",
" <td>NaN</td>\n",
" <td>SomeTown</td>\n",
" <td>SD</td>\n",
" <td>298</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Joan \"the bone\", Anne</td>\n",
" <td>Jet</td>\n",
" <td>9th, at Terrace plc</td>\n",
" <td>Desert City</td>\n",
" <td>CO</td>\n",
" <td>123</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" First Name Last Name Location \\\n",
"0 John Doe 120 jefferson st. \n",
"1 Jack McGinnis 220 hobo Av. \n",
"2 John \"Da Man\" Repici 120 Jefferson St. \n",
"3 Stephen Tyler 7452 Terrace \"At the Plaza\" road \n",
"4 NaN Blankman NaN \n",
"5 Joan \"the bone\", Anne Jet 9th, at Terrace plc \n",
"\n",
" City State Area Code \n",
"0 Riverside NJ 8075 \n",
"1 Phila PA 9119 \n",
"2 Riverside NJ 8075 \n",
"3 SomeTown SD 91234 \n",
"4 SomeTown SD 298 \n",
"5 Desert City CO 123 "
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# No, so there must be some other kind of workaround.\n",
"# Yes. read_csv lets us specify:\n",
"df2 = pd.read_csv(url, header=None, names=['First Name', 'Last Name', 'Location ', 'City','State','Area Code'])\n",
"df2\n",
"# see https://pandas.pydata.org/docs/user_guide/io.html?highlight=pandas%20csv_read#io-read-csv-table"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# append would work if we looped through, line by line; see below."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Selecting a single column\n",
"\n",
"To select the first column 'First Name', you can pass the column name as a string to the indexing operator.\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Jack\n",
"1 John \"Da Man\"\n",
"2 Stephen\n",
"3 NaN\n",
"4 Joan \"the bone\", Anne\n",
"Name: First Name, dtype: object"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"First Name\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Selecting multiple columns\n",
"\n",
"To select multiple columns, you can pass a list of column names to the indexing operator.\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>First Name</th>\n",
" <th>Last Name</th>\n",
" <th>Location</th>\n",
" <th>City</th>\n",
" <th>State</th>\n",
" <th>Area Code</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Jack</td>\n",
" <td>McGinnis</td>\n",
" <td>220 hobo Av.</td>\n",
" <td>Phila</td>\n",
" <td>PA</td>\n",
" <td>9119</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>John \"Da Man\"</td>\n",
" <td>Repici</td>\n",
" <td>120 Jefferson St.</td>\n",
" <td>Riverside</td>\n",
" <td>NJ</td>\n",
" <td>8075</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Stephen</td>\n",
" <td>Tyler</td>\n",
" <td>7452 Terrace \"At the Plaza\" road</td>\n",
" <td>SomeTown</td>\n",
" <td>SD</td>\n",
" <td>91234</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>Blankman</td>\n",
" <td>NaN</td>\n",
" <td>SomeTown</td>\n",
" <td>SD</td>\n",
" <td>298</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Joan \"the bone\", Anne</td>\n",
" <td>Jet</td>\n",
" <td>9th, at Terrace plc</td>\n",
" <td>Desert City</td>\n",
" <td>CO</td>\n",
" <td>123</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" First Name Last Name Location \\\n",
"0 Jack McGinnis 220 hobo Av. \n",
"1 John \"Da Man\" Repici 120 Jefferson St. \n",
"2 Stephen Tyler 7452 Terrace \"At the Plaza\" road \n",
"3 NaN Blankman NaN \n",
"4 Joan \"the bone\", Anne Jet 9th, at Terrace plc \n",
"\n",
" City State Area Code \n",
"0 Phila PA 9119 \n",
"1 Riverside NJ 8075 \n",
"2 SomeTown SD 91234 \n",
"3 SomeTown SD 298 \n",
"4 Desert City CO 123 "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = df[['First Name', 'Last Name', 'Location ', 'City','State','Area Code']]\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Selecting rows using .iloc and .loc\n",
"\n",
"Now, let's see how to use .loc for selecting rows from our DataFrame.\n",
"\n",
"**loc() : loc() is label based data selecting method which means that we have to pass the name of the row or column which we want to select.**\n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"First Name Jack\n",
"Last Name McGinnis\n",
"Location 220 hobo Av.\n",
"City Phila\n",
"State PA\n",
"Area Code 9119\n",
"Name: 0, dtype: object"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# To select the first row\n",
"df.loc[0]"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Jack\n",
"1 John \"Da Man\"\n",
"2 Stephen\n",
"Name: First Name, dtype: object"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# To select the 0th,1st and 2nd row of \"First Name\" column only\n",
"df.loc[[0,1,2], \"First Name\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's see how to use .iloc for selecting rows from our DataFrame.\n",
"\n",
"**iloc() : iloc() is a indexed based selecting method which means that we have to pass integer index in the method to select specific row/column.**\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Jack\n",
"1 John \"Da Man\"\n",
"2 Stephen\n",
"Name: First Name, dtype: object"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# To select the 0th,1st and 2nd row of \"First Name\" column only\n",
"df.iloc[[0,1,2], 0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For more information please read the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html?utm_medium=Exinfluencer\\&utm_source=Exinfluencer\\&utm_content=000026UJ\\&utm_term=10006555\\&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01).\n",
"\n",
"Let perform some basic transformation in pandas.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Transform Function in Pandas\n",
"\n",
"Python’s Transform function returns a self-produced dataframe with transformed values after applying the function specified in its parameter.\n",
"\n",
"Let's see how Transform function works.\n"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"#import library\n",
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>a</th>\n",
" <th>b</th>\n",
" <th>c</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>7</td>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" a b c\n",
"0 1 2 3\n",
"1 4 5 6\n",
"2 7 8 9"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#creating a dataframe\n",
"df=pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let’s say we want to add 10 to each element in a dataframe:\n"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>a</th>\n",
" <th>b</th>\n",
" <th>c</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>11</td>\n",
" <td>12</td>\n",
" <td>13</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>14</td>\n",
" <td>15</td>\n",
" <td>16</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>17</td>\n",
" <td>18</td>\n",
" <td>19</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" a b c\n",
"0 11 12 13\n",
"1 14 15 16\n",
"2 17 18 19"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#applying the transform function\n",
"df = df.transform(func = lambda x : x + 10)\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we will use DataFrame.transform() function to find the square root to each element of the dataframe.\n"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"result = df.transform(func = ['sqrt'])\n",
"# Omitting the square brackets still works, but doesn't show sqrt in the header row."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead tr th {\n",
" text-align: left;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th>a</th>\n",
" <th>b</th>\n",
" <th>c</th>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <th>sqrt</th>\n",
" <th>sqrt</th>\n",
" <th>sqrt</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3.316625</td>\n",
" <td>3.464102</td>\n",
" <td>3.605551</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3.741657</td>\n",
" <td>3.872983</td>\n",
" <td>4.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4.123106</td>\n",
" <td>4.242641</td>\n",
" <td>4.358899</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" a b c\n",
" sqrt sqrt sqrt\n",
"0 3.316625 3.464102 3.605551\n",
"1 3.741657 3.872983 4.000000\n",
"2 4.123106 4.242641 4.358899"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"result"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For more information about the **transform()** function please read the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html?utm_medium=Exinfluencer\\&utm_source=Exinfluencer\\&utm_content=000026UJ\\&utm_term=10006555\\&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# JSON file Format\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**JSON (JavaScript Object Notation)** is a lightweight data-interchange format. It is easy for humans to read and write.\n",
"\n",
"JSON is built on two structures:\n",
"\n",
"1. A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.\n",
"\n",
"2. An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.\n",
"\n",
"JSON is a language-independent data format. It was derived from JavaScript, but many modern programming languages include code to generate and parse JSON-format data. It is a very common data format with a diverse range of applications.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The text in JSON is done through quoted string which contains the values in key-value mappings within { }. It is similar to the dictionary in Python.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Python supports JSON through a built-in package called **json**. To use this feature, we import the json package in Python script.\n"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"import json"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Writing JSON to a File\n",
"\n",
"This is usually called **serialization**. It is the process of converting an object into a special format which is suitable for transmitting over the network or storing in file or database.\n",
"\n",
"To handle the data flow in a file, the JSON library in Python uses the **dump()** or **dumps()** function to convert the Python objects into their respective JSON object. This makes it easy to write data to files.\n"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"import json\n",
"person = {\n",
" 'first_name' : 'Mark',\n",
" 'last_name' : 'abc',\n",
" 'age' : 27,\n",
" 'address': {\n",
" \"streetAddress\": \"21 2nd Street\",\n",
" \"city\": \"New York\",\n",
" \"state\": \"NY\",\n",
" \"postalCode\": \"10021-3100\"\n",
" }\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### serialization using dump() function\n",
"\n",
"**json.dump()** method can be used for writing to JSON file.\n",
"\n",
"Syntax: json.dump(dict, file_pointer)\n",
"\n",
"Parameters:\n",
"\n",
"1. **dictionary** – name of the dictionary which should be converted to JSON object.\n",
"2. **file pointer** – pointer of the file opened in write or append mode.\n"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [],
"source": [
"with open('person.json', 'w') as f: # writing JSON object\n",
" json.dump(person, f)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### serialization using dumps() function\n",
"\n",
"**json.dumps()** that helps in converting a dictionary to a JSON object.\n",
"\n",
"It takes two parameters:\n",
"\n",
"1. **dictionary** – name of the dictionary which should be converted to JSON object.\n",
"2. **indent** – defines the number of units for indentation\n"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [],
"source": [
"# Serializing json \n",
"json_object = json.dumps(person, indent = 4) \n",
" \n",
"# Writing to sample.json \n",
"with open(\"sample.json\", \"w\") as outfile: \n",
" outfile.write(json_object) "
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{\n",
" \"first_name\": \"Mark\",\n",
" \"last_name\": \"abc\",\n",
" \"age\": 27,\n",
" \"address\": {\n",
" \"streetAddress\": \"21 2nd Street\",\n",
" \"city\": \"New York\",\n",
" \"state\": \"NY\",\n",
" \"postalCode\": \"10021-3100\"\n",
" }\n",
"}\n"
]
}
],
"source": [
"print(json_object)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# dumps writes to a string.\n",
"# dump writes to a file."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our Python objects are now serialized to the file. To deserialize it back to the Python object, we use the load() function.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Reading JSON to a File\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This process is usually called **Deserialization** - it is the reverse of serialization. It converts the special format returned by the serialization back into a usable object.\n",
"\n",
"### Using json.load()\n",
"\n",
"The JSON package has json.load() function that loads the json content from a json file into a dictionary.\n",
"\n",
"It takes one parameter:\n",
"\n",
"File pointer: A file pointer that points to a JSON file.\n"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'first_name': 'Mark', 'last_name': 'abc', 'age': 27, 'address': {'streetAddress': '21 2nd Street', 'city': 'New York', 'state': 'NY', 'postalCode': '10021-3100'}}\n",
"<class 'dict'>\n"
]
}
],
"source": [
"import json \n",
" \n",
"# Opening JSON file \n",
"with open('sample.json', 'r') as openfile: \n",
" \n",
" # Reading from json file \n",
" json_object = json.load(openfile) \n",
" \n",
"print(json_object) \n",
"print(type(json_object)) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# XLSX file format\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**XLSX** is a Microsoft Excel Open XML file format. It is another type of Spreadsheet file format.\n",
"\n",
"In XLSX data is organized under the cells and columns in a sheet.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading the data from XLSX file\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let’s load the data from XLSX file and define the sheet name. For loading the data you can use the Pandas library in python.\n"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [],
"source": [
"import urllib.request\n",
"\n",
"urllib.request.urlretrieve(\"https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%205/data/file_example_XLSX_10.xlsx\", \"sample.xlsx\")\n",
"df = pd.read_excel(\"sample.xlsx\")"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>First Name</th>\n",
" <th>Last Name</th>\n",
" <th>Gender</th>\n",
" <th>Country</th>\n",
" <th>Age</th>\n",
" <th>Date</th>\n",
" <th>Id</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>Dulce</td>\n",
" <td>Abril</td>\n",
" <td>Female</td>\n",
" <td>United States</td>\n",
" <td>32</td>\n",
" <td>15/10/2017</td>\n",
" <td>1562</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>Mara</td>\n",
" <td>Hashimoto</td>\n",
" <td>Female</td>\n",
" <td>Great Britain</td>\n",
" <td>25</td>\n",
" <td>16/08/2016</td>\n",
" <td>1582</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>Philip</td>\n",
" <td>Gent</td>\n",
" <td>Male</td>\n",
" <td>France</td>\n",
" <td>36</td>\n",
" <td>21/05/2015</td>\n",
" <td>2587</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>Kathleen</td>\n",
" <td>Hanner</td>\n",
" <td>Female</td>\n",
" <td>United States</td>\n",
" <td>25</td>\n",
" <td>15/10/2017</td>\n",
" <td>3549</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>Nereida</td>\n",
" <td>Magwood</td>\n",
" <td>Female</td>\n",
" <td>United States</td>\n",
" <td>58</td>\n",
" <td>16/08/2016</td>\n",
" <td>2468</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>6</td>\n",
" <td>Gaston</td>\n",
" <td>Brumm</td>\n",
" <td>Male</td>\n",
" <td>United States</td>\n",
" <td>24</td>\n",
" <td>21/05/2015</td>\n",
" <td>2554</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>Etta</td>\n",
" <td>Hurn</td>\n",
" <td>Female</td>\n",
" <td>Great Britain</td>\n",
" <td>56</td>\n",
" <td>15/10/2017</td>\n",
" <td>3598</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>8</td>\n",
" <td>Earlean</td>\n",
" <td>Melgar</td>\n",
" <td>Female</td>\n",
" <td>United States</td>\n",
" <td>27</td>\n",
" <td>16/08/2016</td>\n",
" <td>2456</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>9</td>\n",
" <td>Vincenza</td>\n",
" <td>Weiland</td>\n",
" <td>Female</td>\n",
" <td>United States</td>\n",
" <td>40</td>\n",
" <td>21/05/2015</td>\n",
" <td>6548</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 First Name Last Name Gender Country Age Date Id\n",
"0 1 Dulce Abril Female United States 32 15/10/2017 1562\n",
"1 2 Mara Hashimoto Female Great Britain 25 16/08/2016 1582\n",
"2 3 Philip Gent Male France 36 21/05/2015 2587\n",
"3 4 Kathleen Hanner Female United States 25 15/10/2017 3549\n",
"4 5 Nereida Magwood Female United States 58 16/08/2016 2468\n",
"5 6 Gaston Brumm Male United States 24 21/05/2015 2554\n",
"6 7 Etta Hurn Female Great Britain 56 15/10/2017 3598\n",
"7 8 Earlean Melgar Female United States 27 16/08/2016 2456\n",
"8 9 Vincenza Weiland Female United States 40 21/05/2015 6548"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# XML file format\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**XML is also known as Extensible Markup Language**. As the name suggests, it is a markup language. It has certain rules for encoding data. XML file format is a human-readable and machine-readable file format.\n",
"\n",
"Pandas does not include any methods to read and write XML files. Here, we will take a look at how we can use other modules to read data from an XML file, and load it into a Pandas DataFrame.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Writing with xml.etree.ElementTree\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The **xml.etree.ElementTree** module comes built-in with Python. It provides functionality for parsing and creating XML documents. ElementTree represents the XML document as a tree. We can move across the document using nodes which are elements and sub-elements of the XML file.\n",
"\n",
"For more information please read the [xml.etree.ElementTree](https://docs.python.org/3/library/xml.etree.elementtree.html?utm_medium=Exinfluencer\\&utm_source=Exinfluencer\\&utm_content=000026UJ\\&utm_term=10006555\\&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkPY0101ENSkillsNetwork19487395-2021-01-01) documentation.\n"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [],
"source": [
"import xml.etree.ElementTree as ET\n",
"\n",
"# create the file structure\n",
"employee = ET.Element('employee')\n",
"details = ET.SubElement(employee, 'details')\n",
"first = ET.SubElement(details, 'firstname')\n",
"second = ET.SubElement(details, 'lastname')\n",
"third = ET.SubElement(details, 'age')\n",
"first.text = 'Shiv'\n",
"second.text = 'Mishra'\n",
"third.text = '23'\n",
"\n",
"# create a new XML file with the results\n",
"mydata1 = ET.ElementTree(employee)\n",
"# myfile = open(\"items2.xml\", \"wb\")\n",
"# myfile.write(mydata)\n",
"with open(\"new_sample.xml\", \"wb\") as files:\n",
" mydata1.write(files)"
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<xml.etree.ElementTree.ElementTree at 0x7f63a3d6c358>"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mydata1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Reading with xml.etree.ElementTree\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's have a look at a one way to read XML data and put it in a Pandas DataFrame. You can see the XML file in the Notepad of your local machine.\n"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2021-06-30 14:55:25-- https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%205/data/Sample-employee-XML-file.xml\n",
"Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104\n",
"Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 1016 [application/xml]\n",
"Saving to: ‘Sample-employee-XML-file.xml’\n",
"\n",
"Sample-employee-XML 100%[===================>] 1016 --.-KB/s in 0s \n",
"\n",
"2021-06-30 14:55:25 (24.2 MB/s) - ‘Sample-employee-XML-file.xml’ saved [1016/1016]\n",
"\n"
]
}
],
"source": [
"import pandas as pd \n",
"\n",
"import xml.etree.ElementTree as etree\n",
"\n",
"!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%205/data/Sample-employee-XML-file.xml"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You would need to firstly parse an XML file and create a list of columns for data frame, then extract useful information from the XML file and add to a pandas data frame.\n",
"\n",
"Here is a sample code that you can use.:\n"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {},
"outputs": [],
"source": [
"tree = etree.parse(\"Sample-employee-XML-file.xml\")\n",
"\n",
"root = tree.getroot()\n",
"columns = [\"firstname\", \"lastname\", \"title\", \"division\", \"building\",\"room\"]\n",
"\n",
"datatframe = pd.DataFrame(columns = columns)\n",
"\n",
"for node in root: \n",
"\n",
" firstname = node.find(\"firstname\").text\n",
"\n",
" lastname = node.find(\"lastname\").text \n",
"\n",
" title = node.find(\"title\").text \n",
" \n",
" division = node.find(\"division\").text \n",
" \n",
" building = node.find(\"building\").text\n",
" \n",
" room = node.find(\"room\").text\n",
" \n",
" datatframe = datatframe.append(pd.Series([firstname, lastname, title, division, building, room], index = columns), ignore_index = True)"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(root)"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>firstname</th>\n",
" <th>lastname</th>\n",
" <th>title</th>\n",
" <th>division</th>\n",
" <th>building</th>\n",
" <th>room</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Shiv</td>\n",
" <td>Mishra</td>\n",
" <td>Engineer</td>\n",
" <td>Computer</td>\n",
" <td>301</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Yuh</td>\n",
" <td>Datta</td>\n",
" <td>developer</td>\n",
" <td>Computer</td>\n",
" <td>303</td>\n",
" <td>02</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Rahil</td>\n",
" <td>Khan</td>\n",
" <td>Tester</td>\n",
" <td>Computer</td>\n",
" <td>304</td>\n",
" <td>10</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Deep</td>\n",
" <td>Parekh</td>\n",
" <td>Designer</td>\n",
" <td>Computer</td>\n",
" <td>305</td>\n",
" <td>14</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" firstname lastname title division building room\n",
"0 Shiv Mishra Engineer Computer 301 11\n",
"1 Yuh Datta developer Computer 303 02\n",
"2 Rahil Khan Tester Computer 304 10\n",
"3 Deep Parekh Designer Computer 305 14"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"datatframe"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Save Data\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Correspondingly, Pandas enables us to save the dataset to csv by using the **dataframe.to_csv()** method, you can add the file path and name along with quotation marks in the parentheses.\n",
"\n",
"For example, if you would save the dataframe df as **employee.csv** to your local machine, you may use the syntax below:\n"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {},
"outputs": [],
"source": [
"datatframe.to_csv(\"employee.csv\", index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also read and save other file formats, we can use similar functions to **`pd.read_csv()`** and **`df.to_csv()`** for other data formats. The functions are listed in the following table:\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>Read/Save Other Data Formats</h2>\n",
"\n",
"| Data Formate | Read | Save |\n",
"| ------------- |:--------------:| ----------------:|\n",
"| csv | `pd.read_csv()` |`df.to_csv()` |\n",
"| json | `pd.read_json()` |`df.to_json()` |\n",
"| excel | `pd.read_excel()`|`df.to_excel()` |\n",
"| hdf | `pd.read_hdf()` |`df.to_hdf()` |\n",
"| sql | `pd.read_sql()` |`df.to_sql()` |\n",
"| ... | ... | ... |\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's move ahead and perform some **Data Analysis**.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Binary File Format\n",
"\n",
"\"Binary\" files are any files where the format isn't made up of readable characters. It contain formatting information that only certain applications or processors can understand. While humans can read text files, binary files must be run on the appropriate software or processor before humans can read them.\n",
"\n",
"Binary files can range from image files like JPEGs or GIFs, audio files like MP3s or binary document formats like Word or PDF.\n",
"\n",
"Let's see how to read an **Image** file.\n",
"\n",
"## Reading the Image file\n",
"\n",
"Python supports very powerful tools when it comes to image processing. Let’s see how to process the images using the **PIL** library.\n",
"\n",
"PIL is the Python Imaging Library which provides the python interpreter with image editing capabilities.\n"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('dog.jpg', <http.client.HTTPMessage at 0x7f63a3c84978>)"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# importing PIL \n",
"from PIL import Image \n",
"\n",
"import urllib.request\n",
"# Downloading dataset\n",
"urllib.request.urlretrieve(\"https://hips.hearstapps.com/hmg-prod.s3.amazonaws.com/images/dog-puppy-on-garden-royalty-free-image-1586966191.jpg\", \"dog.jpg\")"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [
{
"data": {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment