Python is the core language used for data manipulation to perform operations such as analysis, manipulation or optimization on large datasets to achieve the desired format. Python has a large collection of libraries such as pandas, NumPy and Matplotlib to process, clean and analyze large data and numbers.
The Pandas are the most important, using the open-source library to work with relational or labelled data sets. It plays a key role in data manipulation in Python to emphasize faster optimization and process of the data.
In this guide, we are going to deeply understand the pandas library for data manipulation along with the explanation of the main functions and key features of pandas and code examples of them:
Data manipulation refers to the process of utilizing programming languages and libraries to analyze, organize, manipulate and optimize large datasets to make large data more readable, optimized, structured and understandable. Effective data manipulation includes the process of extracting, cleaning, filtering or inserting the data information.
Sometimes, data models or sets contain the copied or similar data twice, which is optional for our desired output or goals. Therefore, data manipulation comes in handy for properly analyzing, cleaning and processing the data set to make the input more efficient and optimized for our model.
Data is the core of every industry, so the use of data manipulation methods, techniques and libraries is vast, from technology to finance to real estate and marketing.
Pandas is the open-source library that handles, manipulates, analyzes and optimizes the data. It provides a structured and easy-to-use data structure with many functions and methods to insert, filter, and append data sets.
Pandas provides a flexible and structured data structure designed to work with the “relational” and “labelled” data easily. Pandas key data structure is called the “DataFrame,” which lets you manage, store and manipulate tabular data.
The following points demonstrate the area of data pandas are well suited for:
You need to install the pandas library because it does not come with the Python installation. So, all you have to use is the package installer command on your console to install the pandas library.
# Install pandas using pip
pip install pandas
(or)
pip3 install pandas
The following command is used to install the pandas library inside windows:
import pandas as PD
After finishing the installation, you need to import the pandas on the top of your program using the import statement.
After importing the library, Let’s now understand DataFrames and the creation of DataFrames from various file formats.
DataFrames are the most important concept used as the data structure to store tabular data. They are the two-dimensional labelled arrays used to hold various data types such as Dictionaries, SQL, and CSV.
Let’s now understand how to create DataFrames from the dictionary data.
Dictionaries are a natural fit for creating DataFrames, where keys become column names, and values become the data in those columns’ columns.
For example, we have a dictionary in Python with the “studentData” that contains the key “Name” with the value in the list format and the key “Percentage” with the list of Percentage each student gets.
studentData = {‘Name’: [‘Joey’, ‘Rock’, ‘Zyne’], ‘Percentage’: [‘40%’, ‘85%’, ‘88%’]}
The following code Demonstrates how you can convert the “studentData” dictionary into the DataFrame using the DataFrame():
import pandas as pd
# Creating a DataFrame from a dictionary
studentData = {'Name': ['Joey', 'Rock', 'Zyne'],
'Percentage': ['40%', '85%', '88%']}
dataFrame_from_dict = pd.DataFrame(studentData)
# Printing the DataFrame
print(dataFrame_from_dict)
The following image demonstrates the data frame created from the dictionary:
As you can see, the image above contains three columns where the Name and Percentage columns are created from the studentData dictionary, and the DataFrame creates the first column to represent the index of each row.
CSV (Comma-Separated Values) files are the common data format that you will often use in Python. Pandas provides a function to load the CSV files and automatically create a DataFrame.
For example, we have a users.csv file that contains some data of the users:
Pandas library provides a read_csv() function to load the CSV files into the program, and then the read_csv() function also creates DataFrame from the CSV file data.
The following code demonstrates the loading of the users.csv file and creating a DataFrame from the data of the users.csv file:
import pandas as pd
# Reading a CSV file and Creating a DataFrame
dataFrame_from_csv = pd.read_csv('users.csv')
# Printing the DataFrame
print(dataFrame_from_csv)
Note: You must pass the CSV file path to read_csv() as a string.
The following image demonstrates the DataFrame created from the users.csv file:
As you can see, the pandas read_csv() function loads the data from the CSV files and then beautifully creates the DataFrame from the data of the CSV file.
Suppose you have a Dataframe with 1000 or more rows, but you want to see only five rows from the above to understand the data format.
To do so, pandas provide a head() method to see the first five rows of the DataFrame. We can call the head() method on our DataFrame variable to return only the utmost five rows of the DataFrame:
import pandas as pd
# Reading a CSV file and Creating a DataFrame
dataFrame_from_csv = pd.read_csv('users.csv')
# Printing the five rows from DataFrame
print(dataFrame_from_csv.head())
The following image shows the output of the head() method to show five rows from the DataFrame:
The info() function prints information about a DataFrame, including the total columns, RangeIndex, memory usage, DataFrame type and many more details.
The info() function is quite useful when you want to get a concise summary of the large dataset or the DataFrame without printing the DataFrame on the output screen.
Let’s use our old users.csv file to load, and create the DataFrame and then call the info() function on the variable that contains the DataFrame to get the information:
import pandas as pd
# Reading a CSV file and Creating a DataFrame
dataFrame_from_csv = pd.read_csv('users.csv')
# Printing the five rows from DataFrame
print(dataFrame_from_csv.info())
The following image shows the output of the info() function to the information about the DataFrame:
The most helpful information you will get using the info() function on a DataFrame is the Total number of columns and the Memory usage.
You can use the append() function to append a new row to the bottom of the DataFrame. The row is defined as the dictionary with the key-value pair of the information, and then the append() function is used to add the row to the DataFrame.
When passing the new row to the append() function, also make sure to pass ignore_index=True to reset the row indices.
The following code demonstrates adding a new row to DataFrame using the append() function:
import pandas as pd
# Creating a DataFrame from a dictionary
studentData = {'Name': ['Joey', 'Rock', 'Zyne'],
'Percentage': ['40%', '85%', '88%']}
dataFrame_from_dict = pd.DataFrame(studentData)
# Printing DataFrame
print('Original DataFrame
------------------')
print(dataFrame_from_dict)
# Adding row of data to the DataFrame
new_student = {'Name': "Bob", 'Percentage': '65%'}
new_df = dataFrame_from_dict.append(new_student, ignore_index=True)
# Printing New DataFrame with the added row
print('
New row added to DataFrame
--------------------------')
print(new_df)
The below image demonstrates DataFrame before or after adding the new row:
You can use the DataFrame.assign() to add a column to the DataFrame. It uses keyword arguments (**kwargs) to add the column to the DataFrame. This method doesn’t mutate the old DataFrame; instead, it returns the new DataFrame after adding a new column to the existing DataFrame.
The following code demonstrates the adding a new column to DataFrame using the assign():
import pandas as pd
# Creating a DataFrame from a dictionary
studentData = {'Name': ['Joey', 'Rock', 'Zyne'],
'Percentage': ['40%', '85%', '88%']}
dataFrame_from_dict = pd.DataFrame(studentData)
# Printing DataFrame
print('Original DataFrame
------------------')
print(dataFrame_from_dict)
# Adding column of data to the DataFrame
age = [24 ,22, 18]
new_df = dataFrame_from_dict.assign(Age=age)
# Printing New DataFrame with the added column
print('
New column added to DataFrame
--------------------------')
print(new_df)
The below image demonstrates DataFrame before or after adding the new column:
You can use the drop() method to remove or delete the row and column from the DataFrame using the index of the row and the label to remove the column. Also, Make sure to add the “axis=1” when removing the column.
The following code demonstrates the deleting column & row from the DataFrame using the drop() method:
import pandas as pd
# Creating a DataFrame from a dictionary
studentData = {'Name': ['Joey', 'Rock', 'Zyne'],
'Percentage': ['40%', '85%', '88%']}
dataFrame_from_dict = pd.DataFrame(studentData)
# Printing DataFrame
print('Original DataFrame
------------------')
print(dataFrame_from_dict)
# Deleting Column & Row from the DataFrame
new_df = dataFrame_from_dict.drop('Percentage', axis=1)
new_df = new_df.drop(2)
# Printing DataFrame after deletng Column & Row
print('
DataFrame after deletng Column & Row
--------------------------')
print(new_df)
The below image demonstrates DataFrame before or after deleting the column & row:
Python language is flexible and versatile in nature; hence, it is used in many fields like Artificial Intelligence (AI), Web Development, IoT and many more. Therefore, it is important to have a good understanding of data manipulation in Python to analyze, organize, and optimize large sets of data. Data Manipulation in Python is useful for web development, website, data analysis and processing.
As a development company, we at Delphin Technologies provide a host of services like designing and developing websites and mobile apps.
Grovy Optiva, A-5, Block-A Sector-68,
Noida-201301 Uttar Pradesh, India
712 H St NESte 1735, City: Washington, State: DC, ZIP Code: 20002
Compass Building, Al Shohada Road, Al Hamra Industrial Zone Ras Al Khaimah, United Arab Emirates
Unit 1603, 16th Floor, The L. Plaza, 367 - 375 Queen’s Road Central, Sheung Wan, Hong Kong