Working with datasets is one of the most important skills in Data Analytics. A dataset is a collection of structured information, usually stored in files like CSV, Excel, or databases.
In Python, we commonly use the pandas library to handle datasets efficiently.
What is a Dataset?
A dataset is a structured collection of data organized in rows and columns.
Rows represent records
Columns represent fields or features
Example:
Name | Age | Salary
Ali | 25 | 50000
Sara | 28 | 60000
Common Dataset File Formats
CSV (Comma Separated Values)
Excel (.xlsx)
JSON
SQL Databases
CSV is the most commonly used format in analytics.
Loading a Dataset in Python
First, install pandas if not installed:
pip install pandas
Import pandas and load a CSV file:
import pandas as pddf = pd.read_csv("data.csv")
For Excel files:
df = pd.read_excel("data.xlsx")
Viewing Data
Display first 5 rows:
df.head()
Display last 5 rows:
df.tail()
Check shape (rows and columns):
df.shape
Check column names:
df.columns
Get summary information:
df.info()
Get statistical summary:
df.describe()
Selecting Data
Select a single column:
df["Age"]
Select multiple columns:
df[["Name", "Salary"]]
Select rows using condition:
df[df["Age"] > 25]
Handling Missing Data
Check missing values:
df.isnull().sum()
Remove missing values:
df.dropna()
Fill missing values:
df.fillna(0)
Sorting Data
Sort by a column:
df.sort_values("Salary")
Sort in descending order:
df.sort_values("Salary", ascending=False)
Saving the Dataset
Save as CSV:
df.to_csv("new_data.csv", index=False)
Save as Excel:
df.to_excel("new_data.xlsx", index=False)
Why Working with Datasets is Important
Data analysis starts with understanding and cleaning data.
Proper dataset handling helps you:
Understand data structure
Identify errors
Clean missing values
Filter useful information
Prepare data for visualization and modeling
Key Takeaway
Working with datasets means loading, exploring, cleaning, filtering, and saving data. Mastering these steps is essential for building a strong foundation in Data Analytics.