Read and Write to an Html File Python
Scout At present This tutorial has a related video class created by the Real Python team. Sentinel it together with the written tutorial to deepen your understanding: Reading and Writing Files With Pandas
Pandas is a powerful and flexible Python package that allows yous to work with labeled and fourth dimension serial information. It also provides statistics methods, enables plotting, and more. I crucial feature of Pandas is its ability to write and read Excel, CSV, and many other types of files. Functions like the Pandas read_csv()
method enable yous to work with files effectively. You can use them to save the data and labels from Pandas objects to a file and load them after every bit Pandas Series
or DataFrame
instances.
In this tutorial, yous'll acquire:
- What the Pandas IO tools API is
- How to read and write data to and from files
- How to piece of work with diverse file formats
- How to work with big data efficiently
Permit'due south start reading and writing files!
Installing Pandas
The code in this tutorial is executed with CPython 3.7.4 and Pandas 0.25.1. It would be beneficial to make sure y'all accept the latest versions of Python and Pandas on your machine. You might desire to create a new virtual environs and install the dependencies for this tutorial.
First, y'all'll need the Pandas library. You may already accept it installed. If you don't, and so you tin install information technology with pip:
In one case the installation process completes, you should accept Pandas installed and fix.
Anaconda is an excellent Python distribution that comes with Python, many useful packages similar Pandas, and a package and environment manager chosen Conda. To larn more virtually Anaconda, check out Setting Upwards Python for Machine Learning on Windows.
If yous don't have Pandas in your virtual surround, then you tin install information technology with Conda:
Conda is powerful equally information technology manages the dependencies and their versions. To learn more about working with Conda, you tin cheque out the official documentation.
Preparing Information
In this tutorial, you'll employ the data related to xx countries. Here'southward an overview of the data and sources yous'll be working with:
-
Country is denoted by the country name. Each country is in the top 10 list for either population, expanse, or gross domestic product (Gdp). The row labels for the dataset are the three-letter country codes defined in ISO 3166-1. The column label for the dataset is
COUNTRY
. -
Population is expressed in millions. The data comes from a list of countries and dependencies past population on Wikipedia. The cavalcade label for the dataset is
POP
. -
Surface area is expressed in thousands of kilometers squared. The data comes from a list of countries and dependencies past area on Wikipedia. The cavalcade label for the dataset is
Area
. -
Gdp is expressed in millions of U.Due south. dollars, according to the United nations data for 2017. You can find this data in the listing of countries by nominal Gdp on Wikipedia. The column label for the dataset is
Gdp
. -
Continent is either Africa, Asia, Oceania, Europe, Due north America, or Due south America. Yous can find this information on Wikipedia every bit well. The column label for the dataset is
CONT
. -
Independence solar day is a date that commemorates a nation'southward independence. The data comes from the listing of national independence days on Wikipedia. The dates are shown in ISO 8601 format. The first four digits represent the year, the side by side two numbers are the month, and the last two are for the day of the calendar month. The cavalcade label for the dataset is
IND_DAY
.
This is how the data looks as a table:
COUNTRY | POP | AREA | GDP | CONT | IND_DAY | |
---|---|---|---|---|---|---|
CHN | China | 1398.72 | 9596.96 | 12234.78 | Asia | |
IND | Republic of india | 1351.16 | 3287.26 | 2575.67 | Asia | 1947-08-15 |
The states | United states of america | 329.74 | 9833.52 | 19485.39 | Northward.America | 1776-07-04 |
IDN | Republic of indonesia | 268.07 | 1910.93 | 1015.54 | Asia | 1945-08-17 |
BRA | Brazil | 210.32 | 8515.77 | 2055.51 | South.America | 1822-09-07 |
PAK | Pakistan | 205.71 | 881.91 | 302.14 | Asia | 1947-08-14 |
NGA | Nigeria | 200.96 | 923.77 | 375.77 | Africa | 1960-10-01 |
BGD | Bangladesh | 167.09 | 147.57 | 245.63 | Asia | 1971-03-26 |
RUS | Russia | 146.79 | 17098.25 | 1530.75 | 1992-06-12 | |
MEX | Mexico | 126.58 | 1964.38 | 1158.23 | Northward.America | 1810-09-16 |
JPN | Nihon | 126.22 | 377.97 | 4872.42 | Asia | |
DEU | Germany | 83.02 | 357.11 | 3693.20 | Europe | |
FRA | France | 67.02 | 640.68 | 2582.49 | Europe | 1789-07-14 |
GBR | UK | 66.44 | 242.50 | 2631.23 | Europe | |
ITA | Italy | 60.36 | 301.34 | 1943.84 | Europe | |
ARG | Argentina | 44.94 | 2780.40 | 637.49 | S.America | 1816-07-09 |
DZA | People's democratic republic of algeria | 43.38 | 2381.74 | 167.56 | Africa | 1962-07-05 |
CAN | Canada | 37.59 | 9984.67 | 1647.12 | N.America | 1867-07-01 |
AUS | Australia | 25.47 | 7692.02 | 1408.68 | Oceania | |
KAZ | Kazakhstan | 18.53 | 2724.xc | 159.41 | Asia | 1991-12-16 |
You may detect that some of the data is missing. For instance, the continent for Russia is not specified considering it spreads across both Europe and Asia. There are too several missing independence days considering the data source omits them.
You can organize this data in Python using a nested dictionary:
data = { 'CHN' : { 'COUNTRY' : 'Prc' , 'POP' : 1_398.72 , 'Surface area' : 9_596.96 , 'GDP' : 12_234.78 , 'CONT' : 'Asia' }, 'IND' : { 'State' : 'Republic of india' , 'Popular' : 1_351.16 , 'AREA' : 3_287.26 , 'Gdp' : 2_575.67 , 'CONT' : 'Asia' , 'IND_DAY' : '1947-08-xv' }, 'USA' : { 'COUNTRY' : 'Usa' , 'POP' : 329.74 , 'Expanse' : 9_833.52 , 'GDP' : 19_485.39 , 'CONT' : 'N.America' , 'IND_DAY' : '1776-07-04' }, 'IDN' : { 'State' : 'Republic of indonesia' , 'POP' : 268.07 , 'Expanse' : 1_910.93 , 'GDP' : 1_015.54 , 'CONT' : 'Asia' , 'IND_DAY' : '1945-08-17' }, 'BRA' : { 'COUNTRY' : 'Brazil' , 'POP' : 210.32 , 'AREA' : 8_515.77 , 'Gdp' : 2_055.51 , 'CONT' : 'Southward.America' , 'IND_DAY' : '1822-09-07' }, 'PAK' : { 'State' : 'Pakistan' , 'POP' : 205.71 , 'AREA' : 881.91 , 'GDP' : 302.fourteen , 'CONT' : 'Asia' , 'IND_DAY' : '1947-08-fourteen' }, 'NGA' : { 'State' : 'Nigeria' , 'Pop' : 200.96 , 'Area' : 923.77 , 'Gross domestic product' : 375.77 , 'CONT' : 'Africa' , 'IND_DAY' : '1960-10-01' }, 'BGD' : { 'Country' : 'Bangladesh' , 'POP' : 167.09 , 'Surface area' : 147.57 , 'GDP' : 245.63 , 'CONT' : 'Asia' , 'IND_DAY' : '1971-03-26' }, 'RUS' : { 'COUNTRY' : 'Russia' , 'Popular' : 146.79 , 'AREA' : 17_098.25 , 'Gross domestic product' : 1_530.75 , 'IND_DAY' : '1992-06-12' }, 'MEX' : { 'COUNTRY' : 'Mexico' , 'Popular' : 126.58 , 'AREA' : 1_964.38 , 'Gdp' : 1_158.23 , 'CONT' : 'N.America' , 'IND_DAY' : '1810-09-16' }, 'JPN' : { 'COUNTRY' : 'Nippon' , 'Popular' : 126.22 , 'Expanse' : 377.97 , 'GDP' : 4_872.42 , 'CONT' : 'Asia' }, 'DEU' : { 'COUNTRY' : 'Frg' , 'Popular' : 83.02 , 'Area' : 357.11 , 'GDP' : 3_693.xx , 'CONT' : 'Europe' }, 'FRA' : { 'Land' : 'French republic' , 'POP' : 67.02 , 'AREA' : 640.68 , 'GDP' : 2_582.49 , 'CONT' : 'Europe' , 'IND_DAY' : '1789-07-fourteen' }, 'GBR' : { 'COUNTRY' : 'UK' , 'POP' : 66.44 , 'Surface area' : 242.l , 'Gdp' : 2_631.23 , 'CONT' : 'Europe' }, 'ITA' : { 'Country' : 'Italy' , 'Popular' : lx.36 , 'Surface area' : 301.34 , 'Gross domestic product' : 1_943.84 , 'CONT' : 'Europe' }, 'ARG' : { 'COUNTRY' : 'Argentine republic' , 'Pop' : 44.94 , 'AREA' : 2_780.40 , 'GDP' : 637.49 , 'CONT' : 'S.America' , 'IND_DAY' : '1816-07-09' }, 'DZA' : { 'COUNTRY' : 'People's democratic republic of algeria' , 'POP' : 43.38 , 'Area' : 2_381.74 , 'GDP' : 167.56 , 'CONT' : 'Africa' , 'IND_DAY' : '1962-07-05' }, 'Can' : { 'COUNTRY' : 'Canada' , 'Popular' : 37.59 , 'Area' : 9_984.67 , 'GDP' : 1_647.12 , 'CONT' : 'Northward.America' , 'IND_DAY' : '1867-07-01' }, 'AUS' : { 'Country' : 'Australia' , 'POP' : 25.47 , 'Surface area' : 7_692.02 , 'GDP' : 1_408.68 , 'CONT' : 'Oceania' }, 'KAZ' : { 'Country' : 'Kazakhstan' , 'POP' : 18.53 , 'Area' : 2_724.ninety , 'Gross domestic product' : 159.41 , 'CONT' : 'Asia' , 'IND_DAY' : '1991-12-16' } } columns = ( 'COUNTRY' , 'POP' , 'Surface area' , 'GDP' , 'CONT' , 'IND_DAY' )
Each row of the tabular array is written as an inner dictionary whose keys are the column names and values are the respective information. These dictionaries are then collected as the values in the outer information
dictionary. The corresponding keys for data
are the three-alphabetic character country codes.
You tin utilize this data
to create an example of a Pandas DataFrame
. Start, you need to import Pandas:
>>>
>>> import pandas as pd
Now that you have Pandas imported, you can utilize the DataFrame
constructor and data
to create a DataFrame
object.
data
is organized in such a way that the country codes correspond to columns. You tin can reverse the rows and columns of a DataFrame
with the property .T
:
>>>
>>> df = pd . DataFrame ( data = information ) . T >>> df COUNTRY POP Surface area GDP CONT IND_DAY CHN China 1398.72 9596.96 12234.8 Asia NaN IND Republic of india 1351.16 3287.26 2575.67 Asia 1947-08-15 USA United states of america 329.74 9833.52 19485.four N.America 1776-07-04 IDN Republic of indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 BRA Brazil 210.32 8515.77 2055.51 S.America 1822-09-07 PAK Pakistan 205.71 881.91 302.14 Asia 1947-08-14 NGA Nigeria 200.96 923.77 375.77 Africa 1960-10-01 BGD Bangladesh 167.09 147.57 245.63 Asia 1971-03-26 RUS Russia 146.79 17098.2 1530.75 NaN 1992-06-12 MEX Mexico 126.58 1964.38 1158.23 N.America 1810-09-sixteen JPN Japan 126.22 377.97 4872.42 Asia NaN DEU Deutschland 83.02 357.11 3693.2 Europe NaN FRA French republic 67.02 640.68 2582.49 Europe 1789-07-14 GBR UK 66.44 242.five 2631.23 Europe NaN ITA Italy 60.36 301.34 1943.84 Europe NaN ARG Argentina 44.94 2780.4 637.49 Due south.America 1816-07-09 DZA Algeria 43.38 2381.74 167.56 Africa 1962-07-05 Tin Canada 37.59 9984.67 1647.12 N.America 1867-07-01 AUS Australia 25.47 7692.02 1408.68 Oceania NaN KAZ Kazakhstan 18.53 2724.ix 159.41 Asia 1991-12-16
Now you have your DataFrame
object populated with the data nearly each land.
Versions of Python older than 3.vi did not guarantee the order of keys in dictionaries. To ensure the club of columns is maintained for older versions of Python and Pandas, you can specify index=columns
:
>>>
>>> df = pd . DataFrame ( data = data , index = columns ) . T
Now that you've prepared your information, you're ready to get-go working with files!
Using the Pandas read_csv()
and .to_csv()
Functions
A comma-separated values (CSV) file is a plaintext file with a .csv
extension that holds tabular data. This is ane of the most popular file formats for storing big amounts of data. Each row of the CSV file represents a unmarried table row. The values in the same row are past default separated with commas, only yous could alter the separator to a semicolon, tab, space, or some other character.
Write a CSV File
You can save your Pandas DataFrame
as a CSV file with .to_csv()
:
>>>
>>> df . to_csv ( 'data.csv' )
That's it! You've created the file data.csv
in your current working directory. You can expand the lawmaking block below to see how your CSV file should look:
,Land,POP,Expanse,Gdp,CONT,IND_DAY CHN,China,1398.72,9596.96,12234.78,Asia, IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15 United states of america,U.s.a.,329.74,9833.52,19485.39,N.America,1776-07-04 IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17 BRA,Brazil,210.32,8515.77,2055.51,South.America,1822-09-07 PAK,Pakistan,205.71,881.91,302.xiv,Asia,1947-08-fourteen NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01 BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26 RUS,Russia,146.79,17098.25,1530.75,,1992-06-12 MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-xvi JPN,Japan,126.22,377.97,4872.42,Asia, DEU,Federal republic of germany,83.02,357.xi,3693.two,Europe, FRA,France,67.02,640.68,2582.49,Europe,1789-07-14 GBR,UK,66.44,242.5,2631.23,Europe, ITA,Italy,threescore.36,301.34,1943.84,Europe, ARG,Argentina,44.94,2780.4,637.49,S.America,1816-07-09 DZA,Algeria,43.38,2381.74,167.56,Africa,1962-07-05 Tin,Canada,37.59,9984.67,1647.12,N.America,1867-07-01 AUS,Australia,25.47,7692.02,1408.68,Oceania, KAZ,Republic of kazakhstan,18.53,2724.9,159.41,Asia,1991-12-sixteen
This text file contains the data separated with commas. The first cavalcade contains the row labels. In some cases, yous'll find them irrelevant. If you don't want to keep them, then you can laissez passer the argument index=Fake
to .to_csv()
.
Read a CSV File
Once your data is saved in a CSV file, you'll likely desire to load and apply it from time to fourth dimension. Yous can do that with the Pandas read_csv()
office:
>>>
>>> df = pd . read_csv ( 'data.csv' , index_col = 0 ) >>> df COUNTRY Pop Area Gross domestic product CONT IND_DAY CHN Red china 1398.72 9596.96 12234.78 Asia NaN IND Republic of india 1351.16 3287.26 2575.67 Asia 1947-08-fifteen U.s.a. United states 329.74 9833.52 19485.39 N.America 1776-07-04 IDN Indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 BRA Brazil 210.32 8515.77 2055.51 South.America 1822-09-07 PAK Pakistan 205.71 881.91 302.14 Asia 1947-08-14 NGA Nigeria 200.96 923.77 375.77 Africa 1960-10-01 BGD Bangladesh 167.09 147.57 245.63 Asia 1971-03-26 RUS Russian federation 146.79 17098.25 1530.75 NaN 1992-06-12 MEX Mexico 126.58 1964.38 1158.23 N.America 1810-09-16 JPN Japan 126.22 377.97 4872.42 Asia NaN DEU Federal republic of germany 83.02 357.xi 3693.20 Europe NaN FRA France 67.02 640.68 2582.49 Europe 1789-07-14 GBR Britain 66.44 242.50 2631.23 Europe NaN ITA Italy 60.36 301.34 1943.84 Europe NaN ARG Argentina 44.94 2780.40 637.49 S.America 1816-07-09 DZA People's democratic republic of algeria 43.38 2381.74 167.56 Africa 1962-07-05 Can Canada 37.59 9984.67 1647.12 Northward.America 1867-07-01 AUS Australia 25.47 7692.02 1408.68 Oceania NaN KAZ Kazakhstan 18.53 2724.90 159.41 Asia 1991-12-16
In this instance, the Pandas read_csv()
role returns a new DataFrame
with the information and labels from the file data.csv
, which you specified with the start argument. This string tin be whatever valid path, including URLs.
The parameter index_col
specifies the cavalcade from the CSV file that contains the row labels. Yous assign a zero-based column index to this parameter. You should determine the value of index_col
when the CSV file contains the row labels to avoid loading them as information.
You'll learn more than almost using Pandas with CSV files subsequently in this tutorial. You can also check out Reading and Writing CSV Files in Python to encounter how to handle CSV files with the born Python library csv as well.
Using Pandas to Write and Read Excel Files
Microsoft Excel is probably the most widely-used spreadsheet software. While older versions used binary .xls
files, Excel 2007 introduced the new XML-based .xlsx
file. Yous can read and write Excel files in Pandas, similar to CSV files. However, yous'll demand to install the following Python packages first:
- xlwt to write to
.xls
files - openpyxl or XlsxWriter to write to
.xlsx
files - xlrd to read Excel files
You lot can install them using pip with a single control:
$ pip install xlwt openpyxl xlsxwriter xlrd
You lot can also use Conda:
$ conda install xlwt openpyxl xlsxwriter xlrd
Please annotation that you lot don't have to install all these packages. For example, you don't demand both openpyxl and XlsxWriter. If you're going to work merely with .xls
files, and then you don't need any of them! However, if you intend to work only with .xlsx
files, and then you're going to demand at least one of them, merely not xlwt
. Take some time to decide which packages are correct for your project.
Write an Excel File
In one case y'all have those packages installed, y'all tin can save your DataFrame
in an Excel file with .to_excel()
:
>>>
>>> df . to_excel ( 'data.xlsx' )
The argument 'data.xlsx'
represents the target file and, optionally, its path. The in a higher place statement should create the file data.xlsx
in your current working directory. That file should expect like this:
The first column of the file contains the labels of the rows, while the other columns store information.
Read an Excel File
Yous tin load data from Excel files with read_excel()
:
>>>
>>> df = pd . read_excel ( 'data.xlsx' , index_col = 0 ) >>> df COUNTRY Pop Expanse GDP CONT IND_DAY CHN China 1398.72 9596.96 12234.78 Asia NaN IND India 1351.16 3287.26 2575.67 Asia 1947-08-15 USA US 329.74 9833.52 19485.39 Northward.America 1776-07-04 IDN Republic of indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 BRA Brazil 210.32 8515.77 2055.51 S.America 1822-09-07 PAK Pakistan 205.71 881.91 302.fourteen Asia 1947-08-14 NGA Nigeria 200.96 923.77 375.77 Africa 1960-10-01 BGD Bangladesh 167.09 147.57 245.63 Asia 1971-03-26 RUS Russia 146.79 17098.25 1530.75 NaN 1992-06-12 MEX United mexican states 126.58 1964.38 1158.23 N.America 1810-09-16 JPN Japan 126.22 377.97 4872.42 Asia NaN DEU Germany 83.02 357.11 3693.20 Europe NaN FRA France 67.02 640.68 2582.49 Europe 1789-07-xiv GBR UK 66.44 242.l 2631.23 Europe NaN ITA Italy 60.36 301.34 1943.84 Europe NaN ARG Argentina 44.94 2780.40 637.49 S.America 1816-07-09 DZA Algeria 43.38 2381.74 167.56 Africa 1962-07-05 CAN Canada 37.59 9984.67 1647.12 Due north.America 1867-07-01 AUS Australia 25.47 7692.02 1408.68 Oceania NaN KAZ Kazakhstan xviii.53 2724.ninety 159.41 Asia 1991-12-xvi
read_excel()
returns a new DataFrame
that contains the values from data.xlsx
. You can also use read_excel()
with OpenDocument spreadsheets, or .ods
files.
You'll acquire more about working with Excel files subsequently on in this tutorial. You tin can besides bank check out Using Pandas to Read Large Excel Files in Python.
Understanding the Pandas IO API
Pandas IO Tools is the API that allows y'all to salve the contents of Series
and DataFrame
objects to the clipboard, objects, or files of various types. It also enables loading information from the clipboard, objects, or files.
Write Files
Series
and DataFrame
objects have methods that enable writing information and labels to the clipboard or files. They're named with the pattern .to_<file-type>()
, where <file-type>
is the type of the target file.
You've learned about .to_csv()
and .to_excel()
, but there are others, including:
-
.to_json()
-
.to_html()
-
.to_sql()
-
.to_pickle()
In that location are still more file types that you can write to, so this list is not exhaustive.
These methods take parameters specifying the target file path where y'all saved the information and labels. This is mandatory in some cases and optional in others. If this selection is available and you choose to omit it, and then the methods return the objects (like strings or iterables) with the contents of DataFrame
instances.
The optional parameter pinch
decides how to compress the file with the data and labels. You'll acquire more about it after. There are a few other parameters, but they're mostly specific to ane or several methods. You won't go into them in item here.
Read Files
Pandas functions for reading the contents of files are named using the pattern .read_<file-type>()
, where <file-blazon>
indicates the type of the file to read. You've already seen the Pandas read_csv()
and read_excel()
functions. Here are a few others:
-
read_json()
-
read_html()
-
read_sql()
-
read_pickle()
These functions have a parameter that specifies the target file path. It tin can exist any valid string that represents the path, either on a local machine or in a URL. Other objects are too acceptable depending on the file blazon.
The optional parameter compression
determines the type of decompression to employ for the compressed files. You'll learn nearly it later on in this tutorial. In that location are other parameters, just they're specific to one or several functions. Yous won't go into them in detail here.
Working With Different File Types
The Pandas library offers a broad range of possibilities for saving your data to files and loading information from files. In this department, you lot'll learn more nearly working with CSV and Excel files. You'll also encounter how to utilise other types of files, similar JSON, spider web pages, databases, and Python pickle files.
CSV Files
Y'all've already learned how to read and write CSV files. Now let's dig a petty deeper into the details. When you use .to_csv()
to save your DataFrame
, you lot can provide an argument for the parameter path_or_buf
to specify the path, name, and extension of the target file.
path_or_buf
is the first argument .to_csv()
will go. It can be whatsoever cord that represents a valid file path that includes the file name and its extension. You've seen this in a previous example. However, if you omit path_or_buf
, then .to_csv()
won't create any files. Instead, information technology'll return the corresponding cord:
>>>
>>> df = pd . DataFrame ( data = data ) . T >>> south = df . to_csv () >>> print ( due south ) ,State,Popular,AREA,GDP,CONT,IND_DAY CHN,People's republic of china,1398.72,9596.96,12234.78,Asia, IND,Republic of india,1351.xvi,3287.26,2575.67,Asia,1947-08-15 U.s.a.,The states,329.74,9833.52,19485.39,Due north.America,1776-07-04 IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17 BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07 PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-xiv NGA,Nigeria,200.96,923.77,375.77,Africa,1960-ten-01 BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26 RUS,Russia,146.79,17098.25,1530.75,,1992-06-12 MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-sixteen JPN,Nihon,126.22,377.97,4872.42,Asia, DEU,Germany,83.02,357.11,3693.2,Europe, FRA,French republic,67.02,640.68,2582.49,Europe,1789-07-14 GBR,UK,66.44,242.5,2631.23,Europe, ITA,Italy,60.36,301.34,1943.84,Europe, ARG,Argentina,44.94,2780.4,637.49,S.America,1816-07-09 DZA,Algeria,43.38,2381.74,167.56,Africa,1962-07-05 CAN,Canada,37.59,9984.67,1647.12,North.America,1867-07-01 AUS,Australia,25.47,7692.02,1408.68,Oceania, KAZ,Kazakhstan,xviii.53,2724.9,159.41,Asia,1991-12-sixteen
At present you have the string s
instead of a CSV file. Y'all besides have some missing values in your DataFrame
object. For example, the continent for Russia and the independence days for several countries (China, Japan, and and then on) are not available. In data science and auto learning, yous must handle missing values advisedly. Pandas excels here! By default, Pandas uses the NaN value to replace the missing values.
The continent that corresponds to Russia in df
is nan
:
>>>
>>> df . loc [ 'RUS' , 'CONT' ] nan
This example uses .loc[]
to get data with the specified row and column names.
When you salvage your DataFrame
to a CSV file, empty strings (''
) volition stand for the missing data. You can run across this both in your file data.csv
and in the string s
. If yous desire to change this behavior, so apply the optional parameter na_rep
:
>>>
>>> df . to_csv ( 'new-data.csv' , na_rep = '(missing)' )
This code produces the file new-data.csv
where the missing values are no longer empty strings. You tin expand the code block beneath to meet how this file should look:
,State,POP,Area,Gross domestic product,CONT,IND_DAY CHN,China,1398.72,9596.96,12234.78,Asia,(missing) IND,India,1351.sixteen,3287.26,2575.67,Asia,1947-08-15 USA,US,329.74,9833.52,19485.39,N.America,1776-07-04 IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17 BRA,Brazil,210.32,8515.77,2055.51,S.America,1822-09-07 PAK,Pakistan,205.71,881.91,302.14,Asia,1947-08-xiv NGA,Nigeria,200.96,923.77,375.77,Africa,1960-ten-01 BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26 RUS,Russia,146.79,17098.25,1530.75,(missing),1992-06-12 MEX,Mexico,126.58,1964.38,1158.23,Northward.America,1810-09-16 JPN,Japan,126.22,377.97,4872.42,Asia,(missing) DEU,Germany,83.02,357.xi,3693.2,Europe,(missing) FRA,France,67.02,640.68,2582.49,Europe,1789-07-fourteen GBR,UK,66.44,242.5,2631.23,Europe,(missing) ITA,Italian republic,60.36,301.34,1943.84,Europe,(missing) ARG,Argentina,44.94,2780.four,637.49,S.America,1816-07-09 DZA,People's democratic republic of algeria,43.38,2381.74,167.56,Africa,1962-07-05 CAN,Canada,37.59,9984.67,1647.12,Northward.America,1867-07-01 AUS,Commonwealth of australia,25.47,7692.02,1408.68,Oceania,(missing) KAZ,Kazakhstan,eighteen.53,2724.nine,159.41,Asia,1991-12-sixteen
Now, the cord '(missing)'
in the file corresponds to the nan
values from df
.
When Pandas reads files, it considers the empty string (''
) and a few others as missing values by default:
-
'nan'
-
'-nan'
-
'NA'
-
'North/A'
-
'NaN'
-
'null'
If you don't want this behavior, and then you can pass keep_default_na=False
to the Pandas read_csv()
function. To specify other labels for missing values, use the parameter na_values
:
>>>
>>> pd . read_csv ( 'new-data.csv' , index_col = 0 , na_values = '(missing)' ) COUNTRY Pop Area GDP CONT IND_DAY CHN People's republic of china 1398.72 9596.96 12234.78 Asia NaN IND India 1351.16 3287.26 2575.67 Asia 1947-08-15 U.s. United states 329.74 9833.52 19485.39 N.America 1776-07-04 IDN Republic of indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 BRA Brazil 210.32 8515.77 2055.51 S.America 1822-09-07 PAK Pakistan 205.71 881.91 302.14 Asia 1947-08-xiv NGA Nigeria 200.96 923.77 375.77 Africa 1960-10-01 BGD Bangladesh 167.09 147.57 245.63 Asia 1971-03-26 RUS Russia 146.79 17098.25 1530.75 NaN 1992-06-12 MEX Mexico 126.58 1964.38 1158.23 Northward.America 1810-09-xvi JPN Japan 126.22 377.97 4872.42 Asia NaN DEU Germany 83.02 357.11 3693.xx Europe NaN FRA France 67.02 640.68 2582.49 Europe 1789-07-14 GBR Britain 66.44 242.50 2631.23 Europe NaN ITA Italy 60.36 301.34 1943.84 Europe NaN ARG Argentina 44.94 2780.40 637.49 S.America 1816-07-09 DZA Algeria 43.38 2381.74 167.56 Africa 1962-07-05 CAN Canada 37.59 9984.67 1647.12 North.America 1867-07-01 AUS Australia 25.47 7692.02 1408.68 Oceania NaN KAZ Kazakhstan 18.53 2724.xc 159.41 Asia 1991-12-16
Here, you've marked the string '(missing)'
as a new missing data label, and Pandas replaced it with nan
when information technology read the file.
When y'all load data from a file, Pandas assigns the information types to the values of each column by default. Y'all tin check these types with .dtypes
:
>>>
>>> df = pd . read_csv ( 'information.csv' , index_col = 0 ) >>> df . dtypes Land object POP float64 Expanse float64 GDP float64 CONT object IND_DAY object dtype: object
The columns with strings and dates ('COUNTRY'
, 'CONT'
, and 'IND_DAY'
) have the data type object
. Meanwhile, the numeric columns contain 64-bit floating-bespeak numbers (float64
).
You can apply the parameter dtype
to specify the desired data types and parse_dates
to force use of datetimes:
>>>
>>> dtypes = { 'POP' : 'float32' , 'AREA' : 'float32' , 'GDP' : 'float32' } >>> df = pd . read_csv ( 'data.csv' , index_col = 0 , dtype = dtypes , ... parse_dates = [ 'IND_DAY' ]) >>> df . dtypes COUNTRY object POP float32 AREA float32 GDP float32 CONT object IND_DAY datetime64[ns] dtype: object >>> df [ 'IND_DAY' ] CHN NaT IND 1947-08-15 United states 1776-07-04 IDN 1945-08-17 BRA 1822-09-07 PAK 1947-08-14 NGA 1960-10-01 BGD 1971-03-26 RUS 1992-06-12 MEX 1810-09-16 JPN NaT DEU NaT FRA 1789-07-14 GBR NaT ITA NaT ARG 1816-07-09 DZA 1962-07-05 CAN 1867-07-01 AUS NaT KAZ 1991-12-16 Name: IND_DAY, dtype: datetime64[ns]
Now, you accept 32-flake floating-point numbers (float32
) equally specified with dtype
. These differ slightly from the original 64-chip numbers because of smaller precision. The values in the last cavalcade are considered equally dates and have the data type datetime64
. That's why the NaN
values in this cavalcade are replaced with NaT
.
Now that you have real dates, you can salve them in the format you like:
>>>
>>> df = pd . read_csv ( 'data.csv' , index_col = 0 , parse_dates = [ 'IND_DAY' ]) >>> df . to_csv ( 'formatted-information.csv' , date_format = '%B %d , %Y' )
Here, you lot've specified the parameter date_format
to be '%B %d, %Y'
. You can expand the code block below to see the resulting file:
,State,POP,Expanse,Gross domestic product,CONT,IND_DAY CHN,Prc,1398.72,9596.96,12234.78,Asia, IND,Bharat,1351.sixteen,3287.26,2575.67,Asia,"Baronial 15, 1947" USA,U.s.a.,329.74,9833.52,19485.39,N.America,"July 04, 1776" IDN,Indonesia,268.07,1910.93,1015.54,Asia,"Baronial 17, 1945" BRA,Brazil,210.32,8515.77,2055.51,S.America,"September 07, 1822" PAK,Islamic republic of pakistan,205.71,881.91,302.14,Asia,"August 14, 1947" NGA,Nigeria,200.96,923.77,375.77,Africa,"October 01, 1960" BGD,People's republic of bangladesh,167.09,147.57,245.63,Asia,"March 26, 1971" RUS,Russian federation,146.79,17098.25,1530.75,,"June 12, 1992" MEX,Mexico,126.58,1964.38,1158.23,N.America,"September 16, 1810" JPN,Nippon,126.22,377.97,4872.42,Asia, DEU,Germany,83.02,357.eleven,3693.two,Europe, FRA,French republic,67.02,640.68,2582.49,Europe,"July 14, 1789" GBR,Uk,66.44,242.5,2631.23,Europe, ITA,Italian republic,60.36,301.34,1943.84,Europe, ARG,Argentina,44.94,2780.4,637.49,Southward.America,"July 09, 1816" DZA,Algeria,43.38,2381.74,167.56,Africa,"July 05, 1962" Can,Canada,37.59,9984.67,1647.12,Due north.America,"July 01, 1867" AUS,Australia,25.47,7692.02,1408.68,Oceania, KAZ,Republic of kazakhstan,18.53,2724.9,159.41,Asia,"December 16, 1991"
The format of the dates is different now. The format '%B %d, %Y'
ways the date will start display the total name of the calendar month, then the day followed by a comma, and finally the total year.
There are several other optional parameters that you can employ with .to_csv()
:
-
sep
denotes a values separator. -
decimal
indicates a decimal separator. -
encoding
sets the file encoding. -
header
specifies whether you want to write column labels in the file.
Here'due south how you lot would pass arguments for sep
and header
:
>>>
>>> southward = df . to_csv ( sep = ';' , header = Faux ) >>> print ( s ) CHN;Prc;1398.72;9596.96;12234.78;Asia; IND;Republic of india;1351.16;3287.26;2575.67;Asia;1947-08-15 USA;The states;329.74;9833.52;19485.39;North.America;1776-07-04 IDN;Indonesia;268.07;1910.93;1015.54;Asia;1945-08-17 BRA;Brazil;210.32;8515.77;2055.51;South.America;1822-09-07 PAK;Pakistan;205.71;881.91;302.xiv;Asia;1947-08-14 NGA;Nigeria;200.96;923.77;375.77;Africa;1960-10-01 BGD;Bangladesh;167.09;147.57;245.63;Asia;1971-03-26 RUS;Russia;146.79;17098.25;1530.75;;1992-06-12 MEX;Mexico;126.58;1964.38;1158.23;Due north.America;1810-09-16 JPN;Nihon;126.22;377.97;4872.42;Asia; DEU;Deutschland;83.02;357.11;3693.2;Europe; FRA;France;67.02;640.68;2582.49;Europe;1789-07-14 GBR;UK;66.44;242.5;2631.23;Europe; ITA;Italy;sixty.36;301.34;1943.84;Europe; ARG;Argentina;44.94;2780.4;637.49;S.America;1816-07-09 DZA;Algeria;43.38;2381.74;167.56;Africa;1962-07-05 Tin can;Canada;37.59;9984.67;1647.12;N.America;1867-07-01 AUS;Australia;25.47;7692.02;1408.68;Oceania; KAZ;Kazakhstan;18.53;2724.9;159.41;Asia;1991-12-xvi
The data is separated with a semicolon (';'
) because you lot've specified sep=';'
. Also, since you passed header=False
, you come across your data without the header row of column names.
The Pandas read_csv()
function has many additional options for managing missing data, working with dates and times, quoting, encoding, handling errors, and more than. For instance, if you have a file with one data column and want to go a Series
object instead of a DataFrame
, and then y'all can pass squeeze=True
to read_csv()
. Yous'll larn later on almost data compression and decompression, as well every bit how to skip rows and columns.
JSON Files
JSON stands for JavaScript object notation. JSON files are plaintext files used for data interchange, and humans can read them easily. They follow the ISO/IEC 21778:2017 and ECMA-404 standards and use the .json
extension. Python and Pandas work well with JSON files, as Python's json library offers congenital-in support for them.
You can save the data from your DataFrame
to a JSON file with .to_json()
. Start past creating a DataFrame
object again. Use the dictionary data
that holds the data about countries and and so apply .to_json()
:
>>>
>>> df = pd . DataFrame ( data = data ) . T >>> df . to_json ( 'information-columns.json' )
This code produces the file data-columns.json
. Y'all tin can expand the lawmaking block beneath to see how this file should expect:
{ "Land" :{ "CHN" : "Red china" , "IND" : "Bharat" , "USA" : "Usa" , "IDN" : "Indonesia" , "BRA" : "Brazil" , "PAK" : "Pakistan" , "NGA" : "Nigeria" , "BGD" : "Bangladesh" , "RUS" : "Russia" , "MEX" : "Mexico" , "JPN" : "Japan" , "DEU" : "Federal republic of germany" , "FRA" : "France" , "GBR" : "UK" , "ITA" : "Italian republic" , "ARG" : "Argentina" , "DZA" : "Algeria" , "Tin can" : "Canada" , "AUS" : "Australia" , "KAZ" : "Republic of kazakhstan" }, "POP" :{ "CHN" : 1398.72 , "IND" : 1351.16 , "USA" : 329.74 , "IDN" : 268.07 , "BRA" : 210.32 , "PAK" : 205.71 , "NGA" : 200.96 , "BGD" : 167.09 , "RUS" : 146.79 , "MEX" : 126.58 , "JPN" : 126.22 , "DEU" : 83.02 , "FRA" : 67.02 , "GBR" : 66.44 , "ITA" : 60.36 , "ARG" : 44.94 , "DZA" : 43.38 , "CAN" : 37.59 , "AUS" : 25.47 , "KAZ" : xviii.53 }, "AREA" :{ "CHN" : 9596.96 , "IND" : 3287.26 , "USA" : 9833.52 , "IDN" : 1910.93 , "BRA" : 8515.77 , "PAK" : 881.91 , "NGA" : 923.77 , "BGD" : 147.57 , "RUS" : 17098.25 , "MEX" : 1964.38 , "JPN" : 377.97 , "DEU" : 357.eleven , "FRA" : 640.68 , "GBR" : 242.5 , "ITA" : 301.34 , "ARG" : 2780.iv , "DZA" : 2381.74 , "Tin" : 9984.67 , "AUS" : 7692.02 , "KAZ" : 2724.ix }, "GDP" :{ "CHN" : 12234.78 , "IND" : 2575.67 , "USA" : 19485.39 , "IDN" : 1015.54 , "BRA" : 2055.51 , "PAK" : 302.14 , "NGA" : 375.77 , "BGD" : 245.63 , "RUS" : 1530.75 , "MEX" : 1158.23 , "JPN" : 4872.42 , "DEU" : 3693.2 , "FRA" : 2582.49 , "GBR" : 2631.23 , "ITA" : 1943.84 , "ARG" : 637.49 , "DZA" : 167.56 , "CAN" : 1647.12 , "AUS" : 1408.68 , "KAZ" : 159.41 }, "CONT" :{ "CHN" : "Asia" , "IND" : "Asia" , "United states" : "N.America" , "IDN" : "Asia" , "BRA" : "S.America" , "PAK" : "Asia" , "NGA" : "Africa" , "BGD" : "Asia" , "RUS" : null , "MEX" : "N.America" , "JPN" : "Asia" , "DEU" : "Europe" , "FRA" : "Europe" , "GBR" : "Europe" , "ITA" : "Europe" , "ARG" : "S.America" , "DZA" : "Africa" , "CAN" : "N.America" , "AUS" : "Oceania" , "KAZ" : "Asia" }, "IND_DAY" :{ "CHN" : zip , "IND" : "1947-08-15" , "USA" : "1776-07-04" , "IDN" : "1945-08-17" , "BRA" : "1822-09-07" , "PAK" : "1947-08-fourteen" , "NGA" : "1960-x-01" , "BGD" : "1971-03-26" , "RUS" : "1992-06-12" , "MEX" : "1810-09-16" , "JPN" : null , "DEU" : null , "FRA" : "1789-07-14" , "GBR" : cypher , "ITA" : null , "ARG" : "1816-07-09" , "DZA" : "1962-07-05" , "CAN" : "1867-07-01" , "AUS" : cypher , "KAZ" : "1991-12-16" }}
data-columns.json
has one large dictionary with the column labels as keys and the respective inner dictionaries as values.
You lot tin can become a unlike file structure if you pass an argument for the optional parameter orient
:
>>>
>>> df . to_json ( 'data-index.json' , orient = 'index' )
The orient
parameter defaults to 'columns'
. Here, you've set information technology to index
.
You should get a new file data-index.json
. You can expand the lawmaking block below to see the changes:
{ "CHN" :{ "COUNTRY" : "China" , "POP" : 1398.72 , "Surface area" : 9596.96 , "GDP" : 12234.78 , "CONT" : "Asia" , "IND_DAY" : nada }, "IND" :{ "COUNTRY" : "Bharat" , "POP" : 1351.xvi , "Surface area" : 3287.26 , "GDP" : 2575.67 , "CONT" : "Asia" , "IND_DAY" : "1947-08-15" }, "USA" :{ "COUNTRY" : "US" , "POP" : 329.74 , "AREA" : 9833.52 , "GDP" : 19485.39 , "CONT" : "N.America" , "IND_DAY" : "1776-07-04" }, "IDN" :{ "Country" : "Indonesia" , "POP" : 268.07 , "Expanse" : 1910.93 , "Gdp" : 1015.54 , "CONT" : "Asia" , "IND_DAY" : "1945-08-17" }, "BRA" :{ "COUNTRY" : "Brazil" , "Popular" : 210.32 , "Area" : 8515.77 , "GDP" : 2055.51 , "CONT" : "Due south.America" , "IND_DAY" : "1822-09-07" }, "PAK" :{ "State" : "Pakistan" , "POP" : 205.71 , "Area" : 881.91 , "GDP" : 302.14 , "CONT" : "Asia" , "IND_DAY" : "1947-08-xiv" }, "NGA" :{ "COUNTRY" : "Nigeria" , "POP" : 200.96 , "AREA" : 923.77 , "Gdp" : 375.77 , "CONT" : "Africa" , "IND_DAY" : "1960-x-01" }, "BGD" :{ "Land" : "People's republic of bangladesh" , "Pop" : 167.09 , "AREA" : 147.57 , "Gross domestic product" : 245.63 , "CONT" : "Asia" , "IND_DAY" : "1971-03-26" }, "RUS" :{ "Land" : "Russia" , "POP" : 146.79 , "Expanse" : 17098.25 , "Gross domestic product" : 1530.75 , "CONT" : zero , "IND_DAY" : "1992-06-12" }, "MEX" :{ "COUNTRY" : "Mexico" , "POP" : 126.58 , "AREA" : 1964.38 , "Gdp" : 1158.23 , "CONT" : "N.America" , "IND_DAY" : "1810-09-16" }, "JPN" :{ "State" : "Nihon" , "POP" : 126.22 , "AREA" : 377.97 , "Gdp" : 4872.42 , "CONT" : "Asia" , "IND_DAY" : null }, "DEU" :{ "COUNTRY" : "Germany" , "POP" : 83.02 , "AREA" : 357.11 , "Gross domestic product" : 3693.2 , "CONT" : "Europe" , "IND_DAY" : null }, "FRA" :{ "COUNTRY" : "French republic" , "Pop" : 67.02 , "Area" : 640.68 , "Gdp" : 2582.49 , "CONT" : "Europe" , "IND_DAY" : "1789-07-14" }, "GBR" :{ "COUNTRY" : "UK" , "Pop" : 66.44 , "AREA" : 242.v , "GDP" : 2631.23 , "CONT" : "Europe" , "IND_DAY" : null }, "ITA" :{ "COUNTRY" : "Italy" , "Popular" : threescore.36 , "Expanse" : 301.34 , "GDP" : 1943.84 , "CONT" : "Europe" , "IND_DAY" : goose egg }, "ARG" :{ "COUNTRY" : "Argentina" , "POP" : 44.94 , "AREA" : 2780.4 , "Gdp" : 637.49 , "CONT" : "S.America" , "IND_DAY" : "1816-07-09" }, "DZA" :{ "State" : "Algeria" , "POP" : 43.38 , "Expanse" : 2381.74 , "Gdp" : 167.56 , "CONT" : "Africa" , "IND_DAY" : "1962-07-05" }, "CAN" :{ "State" : "Canada" , "POP" : 37.59 , "AREA" : 9984.67 , "GDP" : 1647.12 , "CONT" : "N.America" , "IND_DAY" : "1867-07-01" }, "AUS" :{ "State" : "Australia" , "POP" : 25.47 , "Surface area" : 7692.02 , "GDP" : 1408.68 , "CONT" : "Oceania" , "IND_DAY" : null }, "KAZ" :{ "State" : "Kazakhstan" , "POP" : eighteen.53 , "Area" : 2724.9 , "Gross domestic product" : 159.41 , "CONT" : "Asia" , "IND_DAY" : "1991-12-sixteen" }}
data-index.json
besides has i large lexicon, only this time the row labels are the keys, and the inner dictionaries are the values.
At that place are few more options for orient
. One of them is 'records'
:
>>>
>>> df . to_json ( 'information-records.json' , orient = 'records' )
This code should yield the file data-records.json
. Yous tin can expand the lawmaking block below to run into the content:
[{ "COUNTRY" : "China" , "POP" : 1398.72 , "AREA" : 9596.96 , "GDP" : 12234.78 , "CONT" : "Asia" , "IND_DAY" : nix },{ "COUNTRY" : "India" , "Pop" : 1351.16 , "AREA" : 3287.26 , "Gdp" : 2575.67 , "CONT" : "Asia" , "IND_DAY" : "1947-08-fifteen" },{ "Country" : "US" , "Pop" : 329.74 , "AREA" : 9833.52 , "Gdp" : 19485.39 , "CONT" : "Due north.America" , "IND_DAY" : "1776-07-04" },{ "COUNTRY" : "Republic of indonesia" , "POP" : 268.07 , "AREA" : 1910.93 , "Gdp" : 1015.54 , "CONT" : "Asia" , "IND_DAY" : "1945-08-17" },{ "COUNTRY" : "Brazil" , "Popular" : 210.32 , "Area" : 8515.77 , "Gross domestic product" : 2055.51 , "CONT" : "S.America" , "IND_DAY" : "1822-09-07" },{ "COUNTRY" : "Pakistan" , "POP" : 205.71 , "AREA" : 881.91 , "Gross domestic product" : 302.14 , "CONT" : "Asia" , "IND_DAY" : "1947-08-fourteen" },{ "Land" : "Nigeria" , "POP" : 200.96 , "AREA" : 923.77 , "Gross domestic product" : 375.77 , "CONT" : "Africa" , "IND_DAY" : "1960-ten-01" },{ "Land" : "Bangladesh" , "POP" : 167.09 , "Surface area" : 147.57 , "GDP" : 245.63 , "CONT" : "Asia" , "IND_DAY" : "1971-03-26" },{ "COUNTRY" : "Russian federation" , "Pop" : 146.79 , "Area" : 17098.25 , "Gross domestic product" : 1530.75 , "CONT" : nix , "IND_DAY" : "1992-06-12" },{ "COUNTRY" : "Mexico" , "POP" : 126.58 , "Surface area" : 1964.38 , "Gross domestic product" : 1158.23 , "CONT" : "Northward.America" , "IND_DAY" : "1810-09-16" },{ "COUNTRY" : "Japan" , "POP" : 126.22 , "AREA" : 377.97 , "GDP" : 4872.42 , "CONT" : "Asia" , "IND_DAY" : null },{ "Land" : "Frg" , "POP" : 83.02 , "Surface area" : 357.11 , "GDP" : 3693.2 , "CONT" : "Europe" , "IND_DAY" : null },{ "COUNTRY" : "France" , "POP" : 67.02 , "Area" : 640.68 , "GDP" : 2582.49 , "CONT" : "Europe" , "IND_DAY" : "1789-07-14" },{ "COUNTRY" : "Britain" , "POP" : 66.44 , "AREA" : 242.five , "GDP" : 2631.23 , "CONT" : "Europe" , "IND_DAY" : goose egg },{ "COUNTRY" : "Italy" , "Popular" : 60.36 , "AREA" : 301.34 , "Gross domestic product" : 1943.84 , "CONT" : "Europe" , "IND_DAY" : null },{ "State" : "Argentina" , "Popular" : 44.94 , "Surface area" : 2780.4 , "Gross domestic product" : 637.49 , "CONT" : "South.America" , "IND_DAY" : "1816-07-09" },{ "Country" : "Algeria" , "POP" : 43.38 , "Expanse" : 2381.74 , "GDP" : 167.56 , "CONT" : "Africa" , "IND_DAY" : "1962-07-05" },{ "COUNTRY" : "Canada" , "POP" : 37.59 , "AREA" : 9984.67 , "Gdp" : 1647.12 , "CONT" : "N.America" , "IND_DAY" : "1867-07-01" },{ "Land" : "Australia" , "Popular" : 25.47 , "AREA" : 7692.02 , "Gross domestic product" : 1408.68 , "CONT" : "Oceania" , "IND_DAY" : zippo },{ "Land" : "Kazakhstan" , "Pop" : xviii.53 , "AREA" : 2724.9 , "Gdp" : 159.41 , "CONT" : "Asia" , "IND_DAY" : "1991-12-16" }]
data-records.json
holds a listing with one dictionary for each row. The row labels are not written.
You lot can get some other interesting file construction with orient='divide'
:
>>>
>>> df . to_json ( 'data-split.json' , orient = 'split' )
The resulting file is data-split.json
. You lot tin can expand the code block below to encounter how this file should look:
{ "columns" :[ "Land" , "Popular" , "AREA" , "GDP" , "CONT" , "IND_DAY" ], "alphabetize" :[ "CHN" , "IND" , "Us" , "IDN" , "BRA" , "PAK" , "NGA" , "BGD" , "RUS" , "MEX" , "JPN" , "DEU" , "FRA" , "GBR" , "ITA" , "ARG" , "DZA" , "CAN" , "AUS" , "KAZ" ], "data" :[[ "People's republic of china" , 1398.72 , 9596.96 , 12234.78 , "Asia" , null ],[ "Bharat" , 1351.16 , 3287.26 , 2575.67 , "Asia" , "1947-08-15" ],[ "United states of america" , 329.74 , 9833.52 , 19485.39 , "N.America" , "1776-07-04" ],[ "Republic of indonesia" , 268.07 , 1910.93 , 1015.54 , "Asia" , "1945-08-17" ],[ "Brazil" , 210.32 , 8515.77 , 2055.51 , "S.America" , "1822-09-07" ],[ "Pakistan" , 205.71 , 881.91 , 302.xiv , "Asia" , "1947-08-xiv" ],[ "Nigeria" , 200.96 , 923.77 , 375.77 , "Africa" , "1960-ten-01" ],[ "Bangladesh" , 167.09 , 147.57 , 245.63 , "Asia" , "1971-03-26" ],[ "Russian federation" , 146.79 , 17098.25 , 1530.75 , null , "1992-06-12" ],[ "Mexico" , 126.58 , 1964.38 , 1158.23 , "N.America" , "1810-09-16" ],[ "Japan" , 126.22 , 377.97 , 4872.42 , "Asia" , nada ],[ "Germany" , 83.02 , 357.eleven , 3693.two , "Europe" , null ],[ "France" , 67.02 , 640.68 , 2582.49 , "Europe" , "1789-07-14" ],[ "UK" , 66.44 , 242.five , 2631.23 , "Europe" , null ],[ "Italia" , 60.36 , 301.34 , 1943.84 , "Europe" , zip ],[ "Argentina" , 44.94 , 2780.iv , 637.49 , "S.America" , "1816-07-09" ],[ "People's democratic republic of algeria" , 43.38 , 2381.74 , 167.56 , "Africa" , "1962-07-05" ],[ "Canada" , 37.59 , 9984.67 , 1647.12 , "Due north.America" , "1867-07-01" ],[ "Australia" , 25.47 , 7692.02 , 1408.68 , "Oceania" , null ],[ "Kazakhstan" , eighteen.53 , 2724.nine , 159.41 , "Asia" , "1991-12-16" ]]}
information-split.json
contains one dictionary that holds the post-obit lists:
- The names of the columns
- The labels of the rows
- The inner lists (ii-dimensional sequence) that agree data values
If you don't provide the value for the optional parameter path_or_buf
that defines the file path, then .to_json()
will return a JSON string instead of writing the results to a file. This behavior is consistent with .to_csv()
.
There are other optional parameters you tin use. For instance, y'all tin can set alphabetize=False
to forgo saving row labels. You can manipulate precision with double_precision
, and dates with date_format
and date_unit
. These last two parameters are particularly important when you have fourth dimension series among your data:
>>>
>>> df = pd . DataFrame ( data = data ) . T >>> df [ 'IND_DAY' ] = pd . to_datetime ( df [ 'IND_DAY' ]) >>> df . dtypes State object POP object Surface area object Gdp object CONT object IND_DAY datetime64[ns] dtype: object >>> df . to_json ( 'data-time.json' )
In this example, yous've created the DataFrame
from the dictionary data
and used to_datetime()
to convert the values in the last column to datetime64
. You can expand the lawmaking block below to see the resulting file:
{ "Land" :{ "CHN" : "China" , "IND" : "Bharat" , "Us" : "United states" , "IDN" : "Indonesia" , "BRA" : "Brazil" , "PAK" : "Islamic republic of pakistan" , "NGA" : "Nigeria" , "BGD" : "Bangladesh" , "RUS" : "Russian federation" , "MEX" : "Mexico" , "JPN" : "Japan" , "DEU" : "Germany" , "FRA" : "France" , "GBR" : "Uk" , "ITA" : "Italy" , "ARG" : "Argentina" , "DZA" : "People's democratic republic of algeria" , "Can" : "Canada" , "AUS" : "Commonwealth of australia" , "KAZ" : "Kazakhstan" }, "Popular" :{ "CHN" : 1398.72 , "IND" : 1351.16 , "USA" : 329.74 , "IDN" : 268.07 , "BRA" : 210.32 , "PAK" : 205.71 , "NGA" : 200.96 , "BGD" : 167.09 , "RUS" : 146.79 , "MEX" : 126.58 , "JPN" : 126.22 , "DEU" : 83.02 , "FRA" : 67.02 , "GBR" : 66.44 , "ITA" : 60.36 , "ARG" : 44.94 , "DZA" : 43.38 , "Tin can" : 37.59 , "AUS" : 25.47 , "KAZ" : eighteen.53 }, "Area" :{ "CHN" : 9596.96 , "IND" : 3287.26 , "USA" : 9833.52 , "IDN" : 1910.93 , "BRA" : 8515.77 , "PAK" : 881.91 , "NGA" : 923.77 , "BGD" : 147.57 , "RUS" : 17098.25 , "MEX" : 1964.38 , "JPN" : 377.97 , "DEU" : 357.11 , "FRA" : 640.68 , "GBR" : 242.5 , "ITA" : 301.34 , "ARG" : 2780.iv , "DZA" : 2381.74 , "CAN" : 9984.67 , "AUS" : 7692.02 , "KAZ" : 2724.9 }, "GDP" :{ "CHN" : 12234.78 , "IND" : 2575.67 , "United states of america" : 19485.39 , "IDN" : 1015.54 , "BRA" : 2055.51 , "PAK" : 302.14 , "NGA" : 375.77 , "BGD" : 245.63 , "RUS" : 1530.75 , "MEX" : 1158.23 , "JPN" : 4872.42 , "DEU" : 3693.2 , "FRA" : 2582.49 , "GBR" : 2631.23 , "ITA" : 1943.84 , "ARG" : 637.49 , "DZA" : 167.56 , "Can" : 1647.12 , "AUS" : 1408.68 , "KAZ" : 159.41 }, "CONT" :{ "CHN" : "Asia" , "IND" : "Asia" , "USA" : "N.America" , "IDN" : "Asia" , "BRA" : "S.America" , "PAK" : "Asia" , "NGA" : "Africa" , "BGD" : "Asia" , "RUS" : null , "MEX" : "North.America" , "JPN" : "Asia" , "DEU" : "Europe" , "FRA" : "Europe" , "GBR" : "Europe" , "ITA" : "Europe" , "ARG" : "S.America" , "DZA" : "Africa" , "Tin can" : "N.America" , "AUS" : "Oceania" , "KAZ" : "Asia" }, "IND_DAY" :{ "CHN" : null , "IND" : -706320000000 , "Us" : -6106060800000 , "IDN" : -769219200000 , "BRA" : -4648924800000 , "PAK" : -706406400000 , "NGA" : -291945600000 , "BGD" : 38793600000 , "RUS" : 708307200000 , "MEX" : -5026838400000 , "JPN" : null , "DEU" : null , "FRA" : -5694969600000 , "GBR" : null , "ITA" : cipher , "ARG" : -4843411200000 , "DZA" : -236476800000 , "CAN" : -3234729600000 , "AUS" : null , "KAZ" : 692841600000 }}
In this file, yous have large integers instead of dates for the independence days. That's because the default value of the optional parameter date_format
is 'epoch'
whenever orient
isn't 'table'
. This default behavior expresses dates every bit an epoch in milliseconds relative to midnight on January 1, 1970.
However, if you laissez passer date_format='iso'
, then you'll go the dates in the ISO 8601 format. In addition, date_unit
decides the units of time:
>>>
>>> df = pd . DataFrame ( data = data ) . T >>> df [ 'IND_DAY' ] = pd . to_datetime ( df [ 'IND_DAY' ]) >>> df . to_json ( 'new-data-fourth dimension.json' , date_format = 'iso' , date_unit = 's' )
This code produces the following JSON file:
{ "State" :{ "CHN" : "China" , "IND" : "India" , "U.s.a." : "US" , "IDN" : "Indonesia" , "BRA" : "Brazil" , "PAK" : "Pakistan" , "NGA" : "Nigeria" , "BGD" : "Bangladesh" , "RUS" : "Russia" , "MEX" : "Mexico" , "JPN" : "Japan" , "DEU" : "Deutschland" , "FRA" : "French republic" , "GBR" : "Great britain" , "ITA" : "Italia" , "ARG" : "Argentina" , "DZA" : "Algeria" , "CAN" : "Canada" , "AUS" : "Commonwealth of australia" , "KAZ" : "Kazakhstan" }, "Pop" :{ "CHN" : 1398.72 , "IND" : 1351.16 , "Us" : 329.74 , "IDN" : 268.07 , "BRA" : 210.32 , "PAK" : 205.71 , "NGA" : 200.96 , "BGD" : 167.09 , "RUS" : 146.79 , "MEX" : 126.58 , "JPN" : 126.22 , "DEU" : 83.02 , "FRA" : 67.02 , "GBR" : 66.44 , "ITA" : 60.36 , "ARG" : 44.94 , "DZA" : 43.38 , "Tin can" : 37.59 , "AUS" : 25.47 , "KAZ" : eighteen.53 }, "Area" :{ "CHN" : 9596.96 , "IND" : 3287.26 , "USA" : 9833.52 , "IDN" : 1910.93 , "BRA" : 8515.77 , "PAK" : 881.91 , "NGA" : 923.77 , "BGD" : 147.57 , "RUS" : 17098.25 , "MEX" : 1964.38 , "JPN" : 377.97 , "DEU" : 357.11 , "FRA" : 640.68 , "GBR" : 242.5 , "ITA" : 301.34 , "ARG" : 2780.4 , "DZA" : 2381.74 , "CAN" : 9984.67 , "AUS" : 7692.02 , "KAZ" : 2724.9 }, "Gross domestic product" :{ "CHN" : 12234.78 , "IND" : 2575.67 , "United states" : 19485.39 , "IDN" : 1015.54 , "BRA" : 2055.51 , "PAK" : 302.14 , "NGA" : 375.77 , "BGD" : 245.63 , "RUS" : 1530.75 , "MEX" : 1158.23 , "JPN" : 4872.42 , "DEU" : 3693.2 , "FRA" : 2582.49 , "GBR" : 2631.23 , "ITA" : 1943.84 , "ARG" : 637.49 , "DZA" : 167.56 , "Tin" : 1647.12 , "AUS" : 1408.68 , "KAZ" : 159.41 }, "CONT" :{ "CHN" : "Asia" , "IND" : "Asia" , "The states" : "N.America" , "IDN" : "Asia" , "BRA" : "S.America" , "PAK" : "Asia" , "NGA" : "Africa" , "BGD" : "Asia" , "RUS" : nada , "MEX" : "N.America" , "JPN" : "Asia" , "DEU" : "Europe" , "FRA" : "Europe" , "GBR" : "Europe" , "ITA" : "Europe" , "ARG" : "S.America" , "DZA" : "Africa" , "CAN" : "N.America" , "AUS" : "Oceania" , "KAZ" : "Asia" }, "IND_DAY" :{ "CHN" : cypher , "IND" : "1947-08-15T00:00:00Z" , "United states" : "1776-07-04T00:00:00Z" , "IDN" : "1945-08-17T00:00:00Z" , "BRA" : "1822-09-07T00:00:00Z" , "PAK" : "1947-08-14T00:00:00Z" , "NGA" : "1960-x-01T00:00:00Z" , "BGD" : "1971-03-26T00:00:00Z" , "RUS" : "1992-06-12T00:00:00Z" , "MEX" : "1810-09-16T00:00:00Z" , "JPN" : nil , "DEU" : null , "FRA" : "1789-07-14T00:00:00Z" , "GBR" : nothing , "ITA" : null , "ARG" : "1816-07-09T00:00:00Z" , "DZA" : "1962-07-05T00:00:00Z" , "CAN" : "1867-07-01T00:00:00Z" , "AUS" : null , "KAZ" : "1991-12-16T00:00:00Z" }}
The dates in the resulting file are in the ISO 8601 format.
You can load the data from a JSON file with read_json()
:
>>>
>>> df = pd . read_json ( 'data-index.json' , orient = 'index' , ... convert_dates = [ 'IND_DAY' ])
The parameter convert_dates
has a like purpose as parse_dates
when you lot use information technology to read CSV files. The optional parameter orient
is very important because it specifies how Pandas understands the structure of the file.
There are other optional parameters you lot can employ as well:
- Set the encoding with
encoding
. - Manipulate dates with
convert_dates
andkeep_default_dates
. - Impact precision with
dtype
andprecise_float
. - Decode numeric data directly to NumPy arrays with
numpy=True
.
Note that you might lose the social club of rows and columns when using the JSON format to shop your data.
HTML Files
An HTML is a plaintext file that uses hypertext markup language to assist browsers return web pages. The extensions for HTML files are .html
and .htm
. You'll need to install an HTML parser library like lxml or html5lib to exist able to work with HTML files:
$pip install lxml html5lib
You lot can too employ Conda to install the same packages:
$ conda install lxml html5lib
One time you have these libraries, you can save the contents of your DataFrame
as an HTML file with .to_html()
:
>>>
df = pd.DataFrame(data=data).T df.to_html('data.html')
This code generates a file information.html
. Yous can expand the code block below to see how this file should look:
< tabular array border = "ane" grade = "dataframe" > < thead > < tr manner = "text-align: correct;" > < th ></ thursday > < th >Land</ thursday > < th >Popular</ thursday > < thursday >Area</ th > < th >Gross domestic product</ th > < th >CONT</ th > < th >IND_DAY</ th > </ tr > </ thead > < tbody > < tr > < thursday >CHN</ th > < td >Red china</ td > < td >1398.72</ td > < td >9596.96</ td > < td >12234.8</ td > < td >Asia</ td > < td >NaN</ td > </ tr > < tr > < th >IND</ th > < td >Republic of india</ td > < td >1351.sixteen</ td > < td >3287.26</ td > < td >2575.67</ td > < td >Asia</ td > < td >1947-08-xv</ td > </ tr > < tr > < th >Usa</ thursday > < td >US</ td > < td >329.74</ td > < td >9833.52</ td > < td >19485.4</ td > < td >Due north.America</ td > < td >1776-07-04</ td > </ tr > < tr > < th >IDN</ th > < td >Indonesia</ td > < td >268.07</ td > < td >1910.93</ td > < td >1015.54</ td > < td >Asia</ td > < td >1945-08-17</ td > </ tr > < tr > < thursday >BRA</ th > < td >Brazil</ td > < td >210.32</ td > < td >8515.77</ td > < td >2055.51</ td > < td >S.America</ td > < td >1822-09-07</ td > </ tr > < tr > < th >PAK</ th > < td >Pakistan</ td > < td >205.71</ td > < td >881.91</ td > < td >302.14</ td > < td >Asia</ td > < td >1947-08-14</ td > </ tr > < tr > < thursday >NGA</ th > < td >Nigeria</ td > < td >200.96</ td > < td >923.77</ td > < td >375.77</ td > < td >Africa</ td > < td >1960-10-01</ td > </ tr > < tr > < thursday >BGD</ thursday > < td >Bangladesh</ td > < td >167.09</ td > < td >147.57</ td > < td >245.63</ td > < td >Asia</ td > < td >1971-03-26</ td > </ tr > < tr > < th >RUS</ th > < td >Russia</ td > < td >146.79</ td > < td >17098.2</ td > < td >1530.75</ td > < td >NaN</ td > < td >1992-06-12</ td > </ tr > < tr > < th >MEX</ thursday > < td >Mexico</ td > < td >126.58</ td > < td >1964.38</ td > < td >1158.23</ td > < td >N.America</ td > < td >1810-09-16</ td > </ tr > < tr > < th >JPN</ thursday > < td >Japan</ td > < td >126.22</ td > < td >377.97</ td > < td >4872.42</ td > < td >Asia</ td > < td >NaN</ td > </ tr > < tr > < th >DEU</ th > < td >Germany</ td > < td >83.02</ td > < td >357.xi</ td > < td >3693.2</ td > < td >Europe</ td > < td >NaN</ td > </ tr > < tr > < th >FRA</ th > < td >France</ td > < td >67.02</ td > < td >640.68</ td > < td >2582.49</ td > < td >Europe</ td > < td >1789-07-14</ td > </ tr > < tr > < thursday >GBR</ th > < td >United kingdom</ td > < td >66.44</ td > < td >242.5</ td > < td >2631.23</ td > < td >Europe</ td > < td >NaN</ td > </ tr > < tr > < th >ITA</ thursday > < td >Italy</ td > < td >60.36</ td > < td >301.34</ td > < td >1943.84</ td > < td >Europe</ td > < td >NaN</ td > </ tr > < tr > < th >ARG</ th > < td >Argentina</ td > < td >44.94</ td > < td >2780.4</ td > < td >637.49</ td > < td >S.America</ td > < td >1816-07-09</ td > </ tr > < tr > < th >DZA</ th > < td >Algeria</ td > < td >43.38</ td > < td >2381.74</ td > < td >167.56</ td > < td >Africa</ td > < td >1962-07-05</ td > </ tr > < tr > < th >CAN</ thursday > < td >Canada</ td > < td >37.59</ td > < td >9984.67</ td > < td >1647.12</ td > < td >Northward.America</ td > < td >1867-07-01</ td > </ tr > < tr > < th >AUS</ thursday > < td >Commonwealth of australia</ td > < td >25.47</ td > < td >7692.02</ td > < td >1408.68</ td > < td >Oceania</ td > < td >NaN</ td > </ tr > < tr > < th >KAZ</ th > < td >Republic of kazakhstan</ td > < td >18.53</ td > < td >2724.9</ td > < td >159.41</ td > < td >Asia</ td > < td >1991-12-16</ td > </ tr > </ tbody > </ table >
This file shows the DataFrame
contents nicely. Nonetheless, discover that you haven't obtained an entire web page. You've simply output the information that corresponds to df
in the HTML format.
.to_html()
won't create a file if you don't provide the optional parameter buf
, which denotes the buffer to write to. If you go out this parameter out, then your code will return a string as it did with .to_csv()
and .to_json()
.
Hither are another optional parameters:
-
header
determines whether to salvage the column names. -
alphabetize
determines whether to salvage the row labels. -
classes
assigns cascading style canvass (CSS) classes. -
render_links
specifies whether to convert URLs to HTML links. -
table_id
assigns the CSSid
to thetabular array
tag. -
escape
decides whether to convert the characters<
,>
, and&
to HTML-safe strings.
You utilise parameters like these to specify different aspects of the resulting files or strings.
Yous can create a DataFrame
object from a suitable HTML file using read_html()
, which will return a DataFrame
instance or a list of them:
>>>
>>> df = pd . read_html ( 'data.html' , index_col = 0 , parse_dates = [ 'IND_DAY' ])
This is very similar to what you did when reading CSV files. You also have parameters that help yous work with dates, missing values, precision, encoding, HTML parsers, and more.
Excel Files
Yous've already learned how to read and write Excel files with Pandas. However, there are a few more options worth considering. For one, when you use .to_excel()
, you can specify the proper noun of the target worksheet with the optional parameter sheet_name
:
>>>
>>> df = pd . DataFrame ( data = information ) . T >>> df . to_excel ( 'information.xlsx' , sheet_name = 'COUNTRIES' )
Here, you create a file information.xlsx
with a worksheet called COUNTRIES
that stores the data. The string 'data.xlsx'
is the statement for the parameter excel_writer
that defines the name of the Excel file or its path.
The optional parameters startrow
and startcol
both default to 0
and indicate the upper left-most cell where the data should first beingness written:
>>>
>>> df . to_excel ( 'data-shifted.xlsx' , sheet_name = 'COUNTRIES' , ... startrow = 2 , startcol = 4 )
Hither, you lot specify that the table should start in the third row and the fifth cavalcade. You also used zero-based indexing, and then the third row is denoted past ii
and the fifth column by four
.
Now the resulting worksheet looks like this:
As yous can see, the table starts in the third row ii
and the fifth column East
.
.read_excel()
as well has the optional parameter sheet_name
that specifies which worksheets to read when loading data. Information technology tin can take on one of the following values:
- The zippo-based index of the worksheet
- The name of the worksheet
- The listing of indices or names to read multiple sheets
- The value
None
to read all sheets
Here's how you would utilize this parameter in your code:
>>>
>>> df = pd . read_excel ( 'information.xlsx' , sheet_name = 0 , index_col = 0 , ... parse_dates = [ 'IND_DAY' ]) >>> df = pd . read_excel ( 'data.xlsx' , sheet_name = 'COUNTRIES' , index_col = 0 , ... parse_dates = [ 'IND_DAY' ])
Both statements above create the same DataFrame
because the sheet_name
parameters take the same values. In both cases, sheet_name=0
and sheet_name='COUNTRIES'
refer to the same worksheet. The statement parse_dates=['IND_DAY']
tells Pandas to try to consider the values in this cavalcade equally dates or times.
There are other optional parameters yous can apply with .read_excel()
and .to_excel()
to determine the Excel engine, the encoding, the style to handle missing values and infinities, the method for writing column names and row labels, and so on.
SQL Files
Pandas IO tools can too read and write databases. In this next example, you lot'll write your data to a database called data.db
. To become started, y'all'll demand the SQLAlchemy package. To learn more most it, you can read the official ORM tutorial. You'll also need the database driver. Python has a built-in driver for SQLite.
You can install SQLAlchemy with pip:
You can besides install it with Conda:
$ conda install sqlalchemy
Once you take SQLAlchemy installed, import create_engine()
and create a database engine:
>>>
>>> from sqlalchemy import create_engine >>> engine = create_engine ( 'sqlite:///data.db' , echo = Faux )
Now that you have everything fix, the next pace is to create a DataFrame
object. It's convenient to specify the data types and apply .to_sql()
.
>>>
>>> dtypes = { 'POP' : 'float64' , 'Expanse' : 'float64' , 'Gross domestic product' : 'float64' , ... 'IND_DAY' : 'datetime64' } >>> df = pd . DataFrame ( data = data ) . T . astype ( dtype = dtypes ) >>> df . dtypes State object Popular float64 AREA float64 Gross domestic product float64 CONT object IND_DAY datetime64[ns] dtype: object
.astype()
is a very convenient method you can apply to set multiple data types at in one case.
Once you've created your DataFrame
, you tin save it to the database with .to_sql()
:
>>>
>>> df . to_sql ( 'data.db' , con = engine , index_label = 'ID' )
The parameter con
is used to specify the database connection or engine that you lot want to use. The optional parameter index_label
specifies how to phone call the database column with the row labels. You lot'll oftentimes run into it take on the value ID
, Id
, or id
.
You should go the database data.db
with a single tabular array that looks like this:
The get-go column contains the row labels. To omit writing them into the database, laissez passer index=Faux
to .to_sql()
. The other columns correspond to the columns of the DataFrame
.
In that location are a few more optional parameters. For example, you lot can use schema
to specify the database schema and dtype
to determine the types of the database columns. You lot can likewise apply if_exists
, which says what to do if a database with the aforementioned name and path already exists:
-
if_exists='neglect'
raises a ValueError and is the default. -
if_exists='replace'
drops the tabular array and inserts new values. -
if_exists='append'
inserts new values into the table.
You can load the data from the database with read_sql()
:
>>>
>>> df = pd . read_sql ( 'data.db' , con = engine , index_col = 'ID' ) >>> df Land POP AREA Gdp CONT IND_DAY ID CHN People's republic of china 1398.72 9596.96 12234.78 Asia NaT IND Bharat 1351.sixteen 3287.26 2575.67 Asia 1947-08-fifteen USA US 329.74 9833.52 19485.39 N.America 1776-07-04 IDN Indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 BRA Brazil 210.32 8515.77 2055.51 S.America 1822-09-07 PAK Pakistan 205.71 881.91 302.14 Asia 1947-08-14 NGA Nigeria 200.96 923.77 375.77 Africa 1960-10-01 BGD Bangladesh 167.09 147.57 245.63 Asia 1971-03-26 RUS Russia 146.79 17098.25 1530.75 None 1992-06-12 MEX Mexico 126.58 1964.38 1158.23 N.America 1810-09-16 JPN Nippon 126.22 377.97 4872.42 Asia NaT DEU Frg 83.02 357.11 3693.20 Europe NaT FRA France 67.02 640.68 2582.49 Europe 1789-07-14 GBR United kingdom of great britain and northern ireland 66.44 242.fifty 2631.23 Europe NaT ITA Italy threescore.36 301.34 1943.84 Europe NaT ARG Argentina 44.94 2780.forty 637.49 S.America 1816-07-09 DZA Algeria 43.38 2381.74 167.56 Africa 1962-07-05 CAN Canada 37.59 9984.67 1647.12 Northward.America 1867-07-01 AUS Australia 25.47 7692.02 1408.68 Oceania NaT KAZ Kazakhstan 18.53 2724.90 159.41 Asia 1991-12-xvi
The parameter index_col
specifies the name of the cavalcade with the row labels. Note that this inserts an extra row after the header that starts with ID
. Yous can ready this beliefs with the post-obit line of code:
>>>
>>> df . index . proper name = None >>> df Country POP Expanse GDP CONT IND_DAY CHN Cathay 1398.72 9596.96 12234.78 Asia NaT IND India 1351.16 3287.26 2575.67 Asia 1947-08-15 U.s. US 329.74 9833.52 19485.39 N.America 1776-07-04 IDN Indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 BRA Brazil 210.32 8515.77 2055.51 Southward.America 1822-09-07 PAK Pakistan 205.71 881.91 302.14 Asia 1947-08-xiv NGA Nigeria 200.96 923.77 375.77 Africa 1960-10-01 BGD Bangladesh 167.09 147.57 245.63 Asia 1971-03-26 RUS Russia 146.79 17098.25 1530.75 None 1992-06-12 MEX Mexico 126.58 1964.38 1158.23 Due north.America 1810-09-16 JPN Japan 126.22 377.97 4872.42 Asia NaT DEU Germany 83.02 357.11 3693.20 Europe NaT FRA France 67.02 640.68 2582.49 Europe 1789-07-14 GBR U.k. 66.44 242.fifty 2631.23 Europe NaT ITA Italian republic lx.36 301.34 1943.84 Europe NaT ARG Argentine republic 44.94 2780.40 637.49 South.America 1816-07-09 DZA Algeria 43.38 2381.74 167.56 Africa 1962-07-05 Tin can Canada 37.59 9984.67 1647.12 N.America 1867-07-01 AUS Australia 25.47 7692.02 1408.68 Oceania NaT KAZ Kazakhstan xviii.53 2724.90 159.41 Asia 1991-12-xvi
Now you have the same DataFrame
object every bit before.
Note that the continent for Russia is now None
instead of nan
. If you desire to make full the missing values with nan
, then you can employ .fillna()
:
>>>
>>> df . fillna ( value = float ( 'nan' ), inplace = True )
.fillna()
replaces all missing values with whatever you pass to value
. Hither, you passed bladder('nan')
, which says to fill up all missing values with nan
.
Also annotation that you lot didn't have to pass parse_dates=['IND_DAY']
to read_sql()
. That's because your database was able to observe that the concluding column contains dates. Withal, you can pass parse_dates
if you'd like. You'll get the same results.
At that place are other functions that you can employ to read databases, like read_sql_table()
and read_sql_query()
. Feel costless to endeavor them out!
Pickle Files
Pickling is the act of converting Python objects into byte streams. Unpickling is the inverse procedure. Python pickle files are the binary files that keep the data and hierarchy of Python objects. They usually have the extension .pickle
or .pkl
.
You tin save your DataFrame
in a pickle file with .to_pickle()
:
>>>
>>> dtypes = { 'POP' : 'float64' , 'AREA' : 'float64' , 'GDP' : 'float64' , ... 'IND_DAY' : 'datetime64' } >>> df = pd . DataFrame ( data = data ) . T . astype ( dtype = dtypes ) >>> df . to_pickle ( 'data.pickle' )
Similar you did with databases, it tin exist convenient first to specify the information types. Then, you create a file information.pickle
to contain your data. Yous could also pass an integer value to the optional parameter protocol
, which specifies the protocol of the pickler.
You can go the data from a pickle file with read_pickle()
:
>>>
>>> df = pd . read_pickle ( 'information.pickle' ) >>> df COUNTRY Popular AREA GDP CONT IND_DAY CHN China 1398.72 9596.96 12234.78 Asia NaT IND Bharat 1351.xvi 3287.26 2575.67 Asia 1947-08-fifteen Usa Us 329.74 9833.52 19485.39 Northward.America 1776-07-04 IDN Republic of indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 BRA Brazil 210.32 8515.77 2055.51 S.America 1822-09-07 PAK Islamic republic of pakistan 205.71 881.91 302.14 Asia 1947-08-14 NGA Nigeria 200.96 923.77 375.77 Africa 1960-10-01 BGD People's republic of bangladesh 167.09 147.57 245.63 Asia 1971-03-26 RUS Russian federation 146.79 17098.25 1530.75 NaN 1992-06-12 MEX Mexico 126.58 1964.38 1158.23 North.America 1810-09-16 JPN Nippon 126.22 377.97 4872.42 Asia NaT DEU Germany 83.02 357.11 3693.20 Europe NaT FRA France 67.02 640.68 2582.49 Europe 1789-07-xiv GBR UK 66.44 242.50 2631.23 Europe NaT ITA Italy threescore.36 301.34 1943.84 Europe NaT ARG Argentina 44.94 2780.forty 637.49 S.America 1816-07-09 DZA Algeria 43.38 2381.74 167.56 Africa 1962-07-05 CAN Canada 37.59 9984.67 1647.12 N.America 1867-07-01 AUS Australia 25.47 7692.02 1408.68 Oceania NaT KAZ Republic of kazakhstan 18.53 2724.xc 159.41 Asia 1991-12-16
read_pickle()
returns the DataFrame
with the stored data. Yous can also check the information types:
>>>
>>> df . dtypes Country object POP float64 Surface area float64 Gdp float64 CONT object IND_DAY datetime64[ns] dtype: object
These are the same ones that you specified before using .to_pickle()
.
As a give-and-take of circumspection, you should always beware of loading pickles from untrusted sources. This can be unsafe! When you unpickle an untrustworthy file, it could execute arbitrary code on your machine, proceeds remote access to your calculator, or otherwise exploit your device in other ways.
Working With Big Data
If your files are too big for saving or processing, then at that place are several approaches you tin accept to reduce the required disk space:
- Compress your files
- Choose merely the columns yous want
- Omit the rows y'all don't need
- Force the use of less precise information types
- Split the data into chunks
You lot'll take a wait at each of these techniques in turn.
Shrink and Decompress Files
You tin create an archive file like you would a regular ane, with the improver of a suffix that corresponds to the desired compression type:
-
'.gz'
-
'.bz2'
-
'.zip'
-
'.xz'
Pandas tin can deduce the compression blazon by itself:
>>>
>>> df = pd . DataFrame ( information = data ) . T >>> df . to_csv ( 'data.csv.zip' )
Hither, you create a compressed .csv
file as an archive. The size of the regular .csv
file is 1048 bytes, while the compressed file only has 766 bytes.
You tin open up this compressed file as usual with the Pandas read_csv()
function:
>>>
>>> df = pd . read_csv ( 'data.csv.zip' , index_col = 0 , ... parse_dates = [ 'IND_DAY' ]) >>> df COUNTRY POP Surface area GDP CONT IND_DAY CHN China 1398.72 9596.96 12234.78 Asia NaT IND Republic of india 1351.16 3287.26 2575.67 Asia 1947-08-fifteen USA US 329.74 9833.52 19485.39 North.America 1776-07-04 IDN Indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 BRA Brazil 210.32 8515.77 2055.51 S.America 1822-09-07 PAK Pakistan 205.71 881.91 302.xiv Asia 1947-08-14 NGA Nigeria 200.96 923.77 375.77 Africa 1960-x-01 BGD Bangladesh 167.09 147.57 245.63 Asia 1971-03-26 RUS Russia 146.79 17098.25 1530.75 NaN 1992-06-12 MEX Mexico 126.58 1964.38 1158.23 N.America 1810-09-sixteen JPN Nihon 126.22 377.97 4872.42 Asia NaT DEU Frg 83.02 357.11 3693.twenty Europe NaT FRA France 67.02 640.68 2582.49 Europe 1789-07-14 GBR UK 66.44 242.fifty 2631.23 Europe NaT ITA Italy 60.36 301.34 1943.84 Europe NaT ARG Argentina 44.94 2780.40 637.49 S.America 1816-07-09 DZA Algeria 43.38 2381.74 167.56 Africa 1962-07-05 CAN Canada 37.59 9984.67 1647.12 Northward.America 1867-07-01 AUS Australia 25.47 7692.02 1408.68 Oceania NaT KAZ Kazakhstan 18.53 2724.90 159.41 Asia 1991-12-16
read_csv()
decompresses the file before reading it into a DataFrame
.
You can specify the blazon of compression with the optional parameter pinch
, which can take on any of the post-obit values:
-
'infer'
-
'gzip'
-
'bz2'
-
'zip'
-
'xz'
-
None
The default value compression='infer'
indicates that Pandas should deduce the compression type from the file extension.
Here's how you would shrink a pickle file:
>>>
>>> df = pd . DataFrame ( data = data ) . T >>> df . to_pickle ( 'data.pickle.shrink' , compression = 'gzip' )
You lot should get the file information.pickle.compress
that you can subsequently decompress and read:
>>>
>>> df = pd . read_pickle ( 'data.pickle.compress' , compression = 'gzip' )
df
again corresponds to the DataFrame
with the same information as before.
Yous can give the other compression methods a try, as well. If you're using pickle files, then keep in mind that the .zip
format supports reading only.
Choose Columns
The Pandas read_csv()
and read_excel()
functions have the optional parameter usecols
that you can utilise to specify the columns you want to load from the file. You can laissez passer the list of cavalcade names as the corresponding argument:
>>>
>>> df = pd . read_csv ( 'information.csv' , usecols = [ 'COUNTRY' , 'AREA' ]) >>> df COUNTRY Expanse 0 China 9596.96 1 India 3287.26 2 Usa 9833.52 3 Republic of indonesia 1910.93 four Brazil 8515.77 5 Pakistan 881.91 6 Nigeria 923.77 7 People's republic of bangladesh 147.57 eight Russia 17098.25 nine Mexico 1964.38 10 Japan 377.97 11 Federal republic of germany 357.11 12 France 640.68 13 UK 242.50 14 Italia 301.34 xv Argentina 2780.40 xvi People's democratic republic of algeria 2381.74 17 Canada 9984.67 18 Australia 7692.02 nineteen Kazakhstan 2724.90
Now yous take a DataFrame
that contains less data than before. Hither, there are only the names of the countries and their areas.
Instead of the column names, you can also laissez passer their indices:
>>>
>>> df = pd . read_csv ( 'information.csv' , index_col = 0 , usecols = [ 0 , 1 , iii ]) >>> df State AREA CHN China 9596.96 IND India 3287.26 USA US 9833.52 IDN Indonesia 1910.93 BRA Brazil 8515.77 PAK Pakistan 881.91 NGA Nigeria 923.77 BGD People's republic of bangladesh 147.57 RUS Russia 17098.25 MEX Mexico 1964.38 JPN Japan 377.97 DEU Frg 357.eleven FRA France 640.68 GBR United kingdom of great britain and northern ireland 242.50 ITA Italy 301.34 ARG Argentina 2780.40 DZA Algeria 2381.74 Tin Canada 9984.67 AUS Australia 7692.02 KAZ Republic of kazakhstan 2724.90
Expand the code cake below to compare these results with the file 'data.csv'
:
,COUNTRY,POP,AREA,Gdp,CONT,IND_DAY CHN,China,1398.72,9596.96,12234.78,Asia, IND,India,1351.16,3287.26,2575.67,Asia,1947-08-15 USA,US,329.74,9833.52,19485.39,N.America,1776-07-04 IDN,Indonesia,268.07,1910.93,1015.54,Asia,1945-08-17 BRA,Brazil,210.32,8515.77,2055.51,Southward.America,1822-09-07 PAK,Pakistan,205.71,881.91,302.xiv,Asia,1947-08-14 NGA,Nigeria,200.96,923.77,375.77,Africa,1960-10-01 BGD,Bangladesh,167.09,147.57,245.63,Asia,1971-03-26 RUS,Russia,146.79,17098.25,1530.75,,1992-06-12 MEX,Mexico,126.58,1964.38,1158.23,N.America,1810-09-16 JPN,Japan,126.22,377.97,4872.42,Asia, DEU,Deutschland,83.02,357.11,3693.2,Europe, FRA,France,67.02,640.68,2582.49,Europe,1789-07-14 GBR,United kingdom,66.44,242.five,2631.23,Europe, ITA,Italy,sixty.36,301.34,1943.84,Europe, ARG,Argentina,44.94,2780.iv,637.49,South.America,1816-07-09 DZA,People's democratic republic of algeria,43.38,2381.74,167.56,Africa,1962-07-05 CAN,Canada,37.59,9984.67,1647.12,N.America,1867-07-01 AUS,Australia,25.47,7692.02,1408.68,Oceania, KAZ,Republic of kazakhstan,eighteen.53,2724.9,159.41,Asia,1991-12-16
You lot can run across the post-obit columns:
- The cavalcade at index
0
contains the row labels. - The column at alphabetize
i
contains the country names. - The column at alphabetize
3
contains the areas.
Simlarly, read_sql()
has the optional parameter columns
that takes a list of column names to read:
>>>
>>> df = pd . read_sql ( 'data.db' , con = engine , index_col = 'ID' , ... columns = [ 'COUNTRY' , 'Expanse' ]) >>> df . index . proper noun = None >>> df COUNTRY AREA CHN Prc 9596.96 IND India 3287.26 USA U.s. 9833.52 IDN Indonesia 1910.93 BRA Brazil 8515.77 PAK Pakistan 881.91 NGA Nigeria 923.77 BGD People's republic of bangladesh 147.57 RUS Russia 17098.25 MEX Mexico 1964.38 JPN Japan 377.97 DEU Germany 357.xi FRA France 640.68 GBR UK 242.50 ITA Italy 301.34 ARG Argentina 2780.twoscore DZA Algeria 2381.74 Can Canada 9984.67 AUS Australia 7692.02 KAZ Kazakhstan 2724.ninety
Again, the DataFrame
just contains the columns with the names of the countries and areas. If columns
is None
or omitted, and so all of the columns volition be read, as yous saw before. The default behavior is columns=None
.
Omit Rows
When you test an algorithm for data processing or auto learning, you ofttimes don't need the entire dataset. Information technology's user-friendly to load merely a subset of the data to speed up the procedure. The Pandas read_csv()
and read_excel()
functions have some optional parameters that permit you to select which rows you want to load:
-
skiprows
: either the number of rows to skip at the beginning of the file if it's an integer, or the zero-based indices of the rows to skip if it'south a list-similar object -
skipfooter
: the number of rows to skip at the cease of the file -
nrows
: the number of rows to read
Here's how yous would skip rows with odd cipher-based indices, keeping the even ones:
>>>
>>> df = pd . read_csv ( 'data.csv' , index_col = 0 , skiprows = range ( i , 20 , 2 )) >>> df Land POP Surface area GDP CONT IND_DAY IND Republic of india 1351.xvi 3287.26 2575.67 Asia 1947-08-15 IDN Indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 PAK Pakistan 205.71 881.91 302.14 Asia 1947-08-14 BGD Bangladesh 167.09 147.57 245.63 Asia 1971-03-26 MEX Mexico 126.58 1964.38 1158.23 Northward.America 1810-09-16 DEU Germany 83.02 357.eleven 3693.twenty Europe NaN GBR UK 66.44 242.50 2631.23 Europe NaN ARG Argentina 44.94 2780.40 637.49 S.America 1816-07-09 CAN Canada 37.59 9984.67 1647.12 N.America 1867-07-01 KAZ Kazakhstan xviii.53 2724.90 159.41 Asia 1991-12-sixteen
In this example, skiprows
is range(i, twenty, 2)
and corresponds to the values 1
, 3
, …, xix
. The instances of the Python built-in course range
acquit similar sequences. The first row of the file data.csv
is the header row. It has the index 0
, and so Pandas loads it in. The second row with index i
corresponds to the characterization CHN
, and Pandas skips information technology. The third row with the index ii
and label IND
is loaded, and and so on.
If you desire to choose rows randomly, and so skiprows
can be a listing or NumPy array with pseudo-random numbers, obtained either with pure Python or with NumPy.
Strength Less Precise Data Types
If you're okay with less precise data types, and then you can potentially salve a meaning amount of memory! First, go the information types with .dtypes
again:
>>>
>>> df = pd . read_csv ( 'data.csv' , index_col = 0 , parse_dates = [ 'IND_DAY' ]) >>> df . dtypes COUNTRY object Pop float64 Expanse float64 GDP float64 CONT object IND_DAY datetime64[ns] dtype: object
The columns with the floating-signal numbers are 64-bit floats. Each number of this type float64
consumes 64 bits or 8 bytes. Each cavalcade has 20 numbers and requires 160 bytes. You tin can verify this with .memory_usage()
:
>>>
>>> df . memory_usage () Index 160 COUNTRY 160 Pop 160 AREA 160 GDP 160 CONT 160 IND_DAY 160 dtype: int64
.memory_usage()
returns an case of Series
with the memory usage of each column in bytes. You tin can conveniently combine information technology with .loc[]
and .sum()
to get the memory for a group of columns:
>>>
>>> df . loc [:, [ 'Popular' , 'AREA' , 'GDP' ]] . memory_usage ( index = Fake ) . sum () 480
This example shows how y'all can combine the numeric columns 'Popular'
, 'AREA'
, and 'Gross domestic product'
to get their total memory requirement. The statement index=False
excludes data for row labels from the resulting Series
object. For these three columns, yous'll demand 480 bytes.
You can also extract the information values in the form of a NumPy array with .to_numpy()
or .values
. Then, utilise the .nbytes
attribute to go the full bytes consumed past the items of the array:
>>>
>>> df . loc [:, [ 'Pop' , 'AREA' , 'GDP' ]] . to_numpy () . nbytes 480
The result is the same 480 bytes. So, how do you save memory?
In this case, yous can specify that your numeric columns 'POP'
, 'Expanse'
, and 'GDP'
should take the type float32
. Use the optional parameter dtype
to do this:
>>>
>>> dtypes = { 'POP' : 'float32' , 'Surface area' : 'float32' , 'Gdp' : 'float32' } >>> df = pd . read_csv ( 'data.csv' , index_col = 0 , dtype = dtypes , ... parse_dates = [ 'IND_DAY' ])
The dictionary dtypes
specifies the desired data types for each column. It's passed to the Pandas read_csv()
office as the argument that corresponds to the parameter dtype
.
Now you can verify that each numeric column needs 80 bytes, or 4 bytes per item:
>>>
>>> df . dtypes COUNTRY object POP float32 Expanse float32 GDP float32 CONT object IND_DAY datetime64[ns] dtype: object >>> df . memory_usage () Index 160 COUNTRY 160 POP 80 Expanse fourscore Gross domestic product 80 CONT 160 IND_DAY 160 dtype: int64 >>> df . loc [:, [ 'POP' , 'AREA' , 'Gdp' ]] . memory_usage ( index = False ) . sum () 240 >>> df . loc [:, [ 'POP' , 'Area' , 'GDP' ]] . to_numpy () . nbytes 240
Each value is a floating-point number of 32 bits or iv bytes. The iii numeric columns comprise 20 items each. In total, you'll need 240 bytes of retention when yous piece of work with the type float32
. This is half the size of the 480 bytes you lot'd demand to work with float64
.
In improver to saving memory, you can significantly reduce the time required to process data by using float32
instead of float64
in some cases.
Utilize Chunks to Iterate Through Files
Another way to deal with very large datasets is to split the information into smaller chunks and process one clamper at a time. If you use read_csv()
, read_json()
or read_sql()
, then you tin can specify the optional parameter chunksize
:
>>>
>>> data_chunk = pd . read_csv ( 'data.csv' , index_col = 0 , chunksize = 8 ) >>> type ( data_chunk ) <class 'pandas.io.parsers.TextFileReader'> >>> hasattr ( data_chunk , '__iter__' ) True >>> hasattr ( data_chunk , '__next__' ) Truthful
chunksize
defaults to None
and tin can take on an integer value that indicates the number of items in a single chunk. When chunksize
is an integer, read_csv()
returns an iterable that y'all tin use in a for
loop to get and procedure only a fragment of the dataset in each iteration:
>>>
>>> for df_chunk in pd . read_csv ( 'data.csv' , index_col = 0 , chunksize = 8 ): ... impress ( df_chunk , finish = ' \n\n ' ) ... print ( 'retentivity:' , df_chunk . memory_usage () . sum (), 'bytes' , ... end = ' \n\north\n ' ) ... COUNTRY POP Surface area GDP CONT IND_DAY CHN Prc 1398.72 9596.96 12234.78 Asia NaN IND Bharat 1351.16 3287.26 2575.67 Asia 1947-08-15 United states Usa 329.74 9833.52 19485.39 N.America 1776-07-04 IDN Republic of indonesia 268.07 1910.93 1015.54 Asia 1945-08-17 BRA Brazil 210.32 8515.77 2055.51 S.America 1822-09-07 PAK Pakistan 205.71 881.91 302.14 Asia 1947-08-fourteen NGA Nigeria 200.96 923.77 375.77 Africa 1960-10-01 BGD Bangladesh 167.09 147.57 245.63 Asia 1971-03-26 memory: 448 bytes Land Popular AREA GDP CONT IND_DAY RUS Russia 146.79 17098.25 1530.75 NaN 1992-06-12 MEX United mexican states 126.58 1964.38 1158.23 N.America 1810-09-xvi JPN Japan 126.22 377.97 4872.42 Asia NaN DEU Germany 83.02 357.xi 3693.20 Europe NaN FRA France 67.02 640.68 2582.49 Europe 1789-07-14 GBR UK 66.44 242.fifty 2631.23 Europe NaN ITA Italia sixty.36 301.34 1943.84 Europe NaN ARG Argentina 44.94 2780.40 637.49 South.America 1816-07-09 memory: 448 bytes State POP Area Gdp CONT IND_DAY DZA Algeria 43.38 2381.74 167.56 Africa 1962-07-05 CAN Canada 37.59 9984.67 1647.12 N.America 1867-07-01 AUS Australia 25.47 7692.02 1408.68 Oceania NaN KAZ Kazakhstan 18.53 2724.xc 159.41 Asia 1991-12-sixteen retentivity: 224 bytes
In this example, the chunksize
is 8
. The get-go iteration of the for
loop returns a DataFrame
with the first eight rows of the dataset only. The second iteration returns some other DataFrame
with the next 8 rows. The third and last iteration returns the remaining 4 rows.
In each iteration, you get and process the DataFrame
with the number of rows equal to chunksize
. It's possible to have fewer rows than the value of chunksize
in the final iteration. Y'all can utilise this functionality to control the amount of memory required to procedure information and keep that amount reasonably minor.
Conclusion
You at present know how to relieve the data and labels from Pandas DataFrame
objects to different kinds of files. Y'all also know how to load your data from files and create DataFrame
objects.
You lot've used the Pandas read_csv()
and .to_csv()
methods to read and write CSV files. You too used similar methods to read and write Excel, JSON, HTML, SQL, and pickle files. These functions are very user-friendly and widely used. They allow you lot to salve or load your data in a single function or method call.
You've also learned how to save fourth dimension, memory, and disk infinite when working with large information files:
- Compress or decompress files
- Choose the rows and columns you want to load
- Use less precise data types
- Split up data into chunks and procedure them i by ane
You lot've mastered a significant step in the car learning and data scientific discipline process! If you have any questions or comments, then please put them in the comments section below.
Spotter Now This tutorial has a related video course created past the Existent Python team. Watch it together with the written tutorial to deepen your understanding: Reading and Writing Files With Pandas
Source: https://realpython.com/pandas-read-write-files/
0 Response to "Read and Write to an Html File Python"
Post a Comment