Sunday, September 8, 2024
Homepandas Data Analysis (Ⅱ)——Pandas

[Study Notes] Data Analysis (Ⅱ)——Pandas

Pandas

    1. What is Pandas?
    • 1.1, Panda overview
  • 1.2. Panda features
  • 1.3. Some differences between Pandas and Numpy
    1. Pandas data structure——Series
    • 2.1, Series Overview
  • 2.2. Series creation
  • 2.3. Common attributes of Series
  • 2.4. Common methods of Series
    1. Pandas data structure——DataFrame
    • 3.1, DataFrame overview
  • 3.2. DataFrame creation
  • 3.3. Column index operation DataFrame
  • 3.4. Row index operation DataFrame
  • 3.5. Conditional query, indexer query
  • 3.6. Common methods of DataFrame
  • 3.7. Common properties of DataFrame
  • 3.8. Statistical functions
    1. Pandas drawing——plot
    • 4.1, plot introduction
  • 4.2. Main parameters of plot
  • 4.3. Line chart
    • 4.3.1, Line chart
  • 4.3.2. Multiple subgraphs
  • 4.3.3. Arrange multiple subgraphs arbitrarily
  • 4.3.4. Main and secondary shaft settings
  • 4.4. Bar chart
    • 4.4.1. Display Series data
  • 4.4.2. Display DateFrame data
  • 4.4.3. Stacked bar chart
  • 4.4.4. Horizontal bar chart
  • 4.4.5. Histogram
  • 4.4.6. Box plot
  • 4.5. Area chart
    • 4.5.1, accumulation type
  • 4.5.2. Non-stacking type
  • 4.6. Pie chart
  • 4.7. Kernel density map
    1. Read and write files
    • 5.1. What is CSV?
  • 5.2. Advantages of CSV files
  • 5.3. Disadvantages of CSV files
  • 5.4. Read CSV files – read_csv
  • 5.5. Write CSV file——to_csv
  • 5.6. Write Excel file——to_excel
  • 5.7. Read Excel files – read_excel
    1. Data table interconnection, grouping aggregation, and data pivoting
    • 6.1, Data table interconnection
  • 6.2. Group aggregation
  • 6.3. Data perspective
    1. Statement

1. What is Pandas?

Install dependencies: pip install pandas
Import: import pandas as pd

1.1. Overview of Panda

Pandas is a Python-based data analysis library that provides rich data structures and data manipulation tools for processing structured data. It is particularly suitable for tasks such as data cleaning, data transformation, data analysis and data visualization (Chinese learning website)

1.2. Panda features

① It provides a simple, efficient DataFrame object with default labels (labels can also be customized)
② Ability to quickly load data from files in different formats (such as Excel, CSV, SQL files) and then convert them into processable objects
③ Ability to group data by row and column labels, and perform aggregation and conversion operations on grouped objects
④ Can easily implement data normalization operations and missing value processing
⑤ Able to easily add, modify or delete data columns of DataFrame
⑥ Able to process data sets in different formats, such as matrix data, heterogeneous data tables, time series, etc.
⑦ Provides a variety of ways to process data sets, such as building subsets, slicing, filtering, grouping, and reordering, etc.

1.3. Some differences between Pandas and Numpy

① Data structure
NumPy mainly provides multi-dimensional array objects ndarray, which is a fixed-size array and all elements are of the same type.
Pandas provides two main data structures: Series and DataFrame. Series is a one-dimensional labeled array, similar to NumPy's one-dimensional array; while DataFrame is a two-dimensional labeled data structure, similar to a spreadsheet or database table
②Usage:
NumPy is mainly used in the fields of numerical computing and scientific computing, providing high-performance numerical computing functions, such as array operations, linear algebra operations, Fourier transform, etc.
Pandas is mainly used in the fields of data analysis and data processing. It provides a wealth of data operation tools, including data cleaning, data conversion, data grouping, data aggregation, data visualization and other functions.
③ Data processing capabilities:
NumPy is more suitable for processing numerical data and provides a wealth of numerical operation functions and methods.
Pandas is more suitable for processing structured data, such as tabular data, time series data, etc., and provides flexible data operation and analysis tools.

Comparison terms Pandas NumPy
Adaptability Pandas is mainly used to process tabular data NumPy is mainly used to process numerical data
Tools Pandas provides Series and DataFrame data structures NumPy builds ndarray arrays to hold data
Performance Pandas is more advantageous for processing data with more than 500,000 rows NumPy has better performance for data with less than 500,000 rows or less
Memory utilization Pandas consumes a lot of memory compared to NumPy NumPy consumes less memory
Object Pandas provides a DataFrame 2D data table object NumPy provides a multidimensional array ndarray object

Method to convert Pandas DataFrame object into ndarray: to_numpy()

import pandas as pd

# ConstructDataFrameobject
df = pd.DataFrame({"P": [2, 3], "Q": [4.0, 5.8]})
df['R'] = pd.date_range('2020-12-23', periods=2)  # add another column
print(df)
print('-' * 30)

# Converted tonumpyarray
n = df.to_numpy()
print(n) 

2. Pandas data structure——Series

2.1. Series Overview

A Series in Pandas is a one-dimensional labeled array, similar to a dictionary in Python or a one-dimensional array in NumPy. It consists of a set of data and a set of indexes (tags) associated with it. It can store any type of data, and the index can be of any type.

2.2. Series creation

① Create using list

import pandas as pd

list1 = [1, 2, 3, 4]
s1 = pd.Series(list1)
print(s1) 
0    1
1    2
2    3
3    4
dtype: int64 

② Create using ndarray object

import pandas as pd
import numpy as np

arr = np.array([1, 2, 3, 4])
s1 = pd.Series(arr, index=['A', 'B', 'C', 'D']) # can useindexto specify the index
print(s1) 
A    1
B    2
C    3
D    4
dtype: int32 

③ Create using dictionary

import pandas as pd

s1 = pd.Series({'name': 'hyh', 'age': 18, 'gender': 'boy'})
print(s1) 
name      hyh
age        18
gender    boy
dtype: object 

④Create using scalars

import pandas as pd

s1 = pd.Series(4, index=['a', 'b', 'c', 'd'])
print(s1) 
a    4
b    4
c    4
d    4
dtype: int64 

2.3. Common attributes of Series

Properties Description
values ​​ Return data in the form of ndarray
index Index is an Index object
axes The index of each axis, which is an array of Index objects
size length
ndim Dimensions of data
dtype data type
import pandas as pd

l1 = [1, 2, 3, 4]
s = pd.Series(l1, index=['a', 'b', 'c', 'd'])
print("Series:\n", s)
print('-' * 30)
# access data
print(s.values, type(s.values))
print('-' * 30)
# access index
print(s.index, list(s.index))
print('-' * 30)
# Access the index of each axis
print(s.axes, s.axes[0] is s.index)
print('-' * 30)
# access size
print(s.size)
print('-' * 30)
# access dimensions
print(s.ndim)
print('-' * 30)
# access data type
print(s.dtype) 
Series:
 a    1
b    2
c    3
d    4
dtype: int64
------------------------------
[1 2 3 4] <class 'numpy.ndarray'>
------------------------------
Index(['a', 'b', 'c', 'd'], dtype='object') ['a', 'b', 'c', 'd']
------------------------------
[Index(['a', 'b', 'c', 'd'], dtype='object')] True
------------------------------
4
------------------------------
1
------------------------------
int64 

2.4. Common methods of Series

Note: ser below refers to the instance object of Series

| Method | Description |
| — | — |
| ser.head(num) | The first num rows of data |
| ser.tail(num) | tail num row data |
| pd.isnull(ser)
pd.nonull(ser) | Detect whether there are missing values ​​and return a Series object |

import pandas as pd
import numpy as np

arr = np.random.randint(1, 10, 5)
s = pd.Series(arr)
print(s)
print('-' * 30)
print(s.head(2))
print('-' * 30)
print(s.tail(2))
print('-' * 30)
print(pd.isnull(s))
print('-' * 30) 
0    8
1    1
2    4
3    2
4    4
dtype: int32
------------------------------
0    8
1    1
dtype: int32
------------------------------
3    2
4    4
dtype: int32
0    False
1    False
2    False
3    False
4    False
dtype: bool 

3. Pandas data structure——DataFrame

3.1. DataFrame overview

DataFrame is one of the most important data structures in Pandas. It is a two-dimensional tagged data structure, similar to a spreadsheet or a table in a database. DataFrame consists of rows and columns. Each column can be of different data types (such as integers, floating point numbers, strings, etc.), and each row and column has a corresponding label index.

3.2. DataFrame creation

① Create using list

import pandas as pd

list1 = [1, 2]
df = pd.DataFrame(list1)
print(df)
print('-' * 30)

list2 = [('Alex', 11), ('Bob', 12), ('Clarke', 13)]
df = pd.DataFrame(list2, columns=['name', 'age'], dtype=float)
print(df) 
 0
0  1
1  2
------------------------------
     name   age
0    Alex  11.0
1     Bob  12.0
2  Clarke  13.0 

② Create using dictionary nested list

import pandas as pd

data1 = {'name': ['Tom', 'Jack', 'Steve', 'Hyh'], 'age': [28, 48, 61, 15]}
df = pd.DataFrame(data1)
print(df)

data2 = {'name': ['Pyy', 'Zayn', 'Yh'], 'age': [32, 51, 23]}
df = pd.DataFrame(data2, index=['rank1', 'rank2', 'rank3'])
print(df) 
 name  age
0    Tom   28
1   Jack   48
2  Steve   61
3    Hyh   15
       name  age
rank1   Pyy   32
rank2  Zayn   51
rank3    Yh   23 

③ Create using list nested dictionary

import pandas as pd

data1 = [{'name': 'Pyy', 'age': 22}, {'name': 'Zayn', 'age': 31, 'gender': 'boy'}]
df = pd.DataFrame(data1)
print(df)
print('-' * 30)

data1 = [{'name': 'Pyy', 'age': 22}, {'name': 'Zayn', 'age': 31, 'gender': 'boy'}]
df = pd.DataFrame(data1, index=['first', 'second'], columns=['age', 'name'])
print(df) 
 name  age gender
0   Pyy   22    NaN
1  Zayn   31    boy
------------------------------
        age  name
first    22   Pyy
second   31  Zayn 

④ Use dictionary nested Series to create

import pandas as pd

ser1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
ser2 = pd.Series([2, 3, 4, 5], index=['b', 'c', 'd', 'e'])
arr = {'one': ser1, 'two': ser2}
df = pd.DataFrame(arr)
print(df) 
 one  two
a  1.0  NaN
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
e  NaN  5.0 

3.3. Column index operation DataFrame

import pandas as pd

ser1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
ser2 = pd.Series([2, 3, 4, 5], index=['b', 'c', 'd', 'e'])
arr = {'one': ser1, 'two': ser2}
df = pd.DataFrame(arr)
print(df)
print('-' * 30)

# access a column,What is returned is aSeriesobject
print(df['one'])
print('-' * 30)

# Access multiple columns,What is returned is aDataFrameobject
print(df[['one', 'two']])
print('-' * 30)

# add a column
df['three'] = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
df.insert(loc=1, column='four', value=[40, 50, 60, 70, 80])  # insertThe data length should be the same as the existingindexlength
print(df)
print('-' * 30)

# Delete a column
df.pop('four')
print(df) 
 one  two
a  1.0  NaN
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
e  NaN  5.0
------------------------------
a    1.0
b    2.0
c    3.0
d    NaN
e    NaN
Name: one, dtype: float64
------------------------------
   one  two
a  1.0  NaN
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
e  NaN  5.0
------------------------------
   one  four  two  three
a  1.0    40  NaN   10.0
b  2.0    50  2.0   20.0
c  3.0    60  3.0   30.0
d  NaN    70  4.0    NaN
e  NaN    80  5.0    NaN
------------------------------
   one  two  three
a  1.0  NaN   10.0
b  2.0  2.0   20.0
c  3.0  3.0   30.0
d  NaN  4.0    NaN
e  NaN  5.0    NaN 

3.4. Row index operation DataFrame

import pandas as pd

ser1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
ser2 = pd.Series([2, 3, 4, 5], index=['b', 'c', 'd', 'e'])
arr = {'one': ser1, 'two': ser2}
df = pd.DataFrame(arr)
print(df)
print('-' * 30)

# Access rows using slices
print(df[1:4])
print('-' * 30)

# uselocaccess line,Need to use the label name of the row
print(df.loc['b'], '\n', dict(df.loc['b']))
print('-' * 30)

# useilocaccess line,Need to use the position index of the row
print(df.iloc[3], '\n', dict(df.iloc[3]))
print('-' * 30)

# Add to
newrow = pd.DataFrame([[10, 20], [30, 40]], columns=['one', 'two'], index=['f', 'g'])
df = df.append(newrow)
print(df)
print('-' * 30)

# delete
df = df.drop('b')
print(df) 
 one  two
a  1.0  NaN
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
e  NaN  5.0
------------------------------
   one  two
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0
------------------------------
one    2.0
two    2.0
Name: b, dtype: float64 
 {'one': 2.0, 'two': 2.0}
------------------------------
one    NaN
two    4.0
Name: d, dtype: float64 
 {'one': nan, 'two': 4.0}
------------------------------
    one   two
a   1.0   NaN
b   2.0   2.0
c   3.0   3.0
d   NaN   4.0
e   NaN   5.0
f  10.0  20.0
g  30.0  40.0
------------------------------
    one   two
a   1.0   NaN
c   3.0   3.0
d   NaN   4.0
e   NaN   5.0
f  10.0  20.0
g  30.0  40.0 

3.5. Conditional query, indexer query

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)),
                  columns=['c1', 'c2', 'c3', 'c4'],
                  index=['i1', 'i2', 'i3', 'i4'])
print(df)
print('-' * 30)

# condition index
print(df[df['c1'] >= 5])
print('-' * 30)

# multi-condition
print(df[(df['c1'] >= 5) & (df['c3'] <= 5)])
print('-' * 30)

# Indexer,In fact, that islocandiloc,First pass in the line,Then pass in the column
print(df.loc[['i1', 'i3'], ['c2', 'c4']])
print('-' * 30)
print(df.loc[df['c1'] >= 5, ['c2', 'c4']])
print('-' * 30)
print(df.iloc[0:2, 1:3]) 
#operation result
    c1  c2  c3  c4
i1   1   5   8   9
i2   1   2   3   3
i3   4   9   5   7
i4   7   9   1   4
------------------------------
    c1  c2  c3  c4
i4   7   9   1   4
------------------------------
    c1  c2  c3  c4
i4   7   9   1   4
------------------------------
    c2  c4
i1   5   9
i3   9   7
------------------------------
    c2  c4
i4   9   4
------------------------------
    c2  c3
i1   5   8
i2   2   3 

3.6. Common methods of DataFrame

| Method | Description |
| — | — |
| rename(columns,index,inplace) | Modify row names and column names
columns: modify the column name, index: modify the row name, inplace: whether to modify the original data, the default is False |
| insert(loc,column,value,allow_duplicates) | Insert columns
loc: insertion position, column: column name, value: value, allow_duplicates whether to allow duplicate column names, the default is False |
| drop(labels,axis,index,inplace) | Delete
labels: name of row/column, index: position of row/column (equivalent to labels, choose one of the two), axis: delete row or column, default 0 means delete row, inplace: whether to delete on the original data, default for False |
| head(n), tail(n) | first and last n lines |

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)),
                  columns=['c1', 'c2', 'c3', 'c4'],
                  index=['i1', 'i2', 'i3', 'i4'])
# original array
print(df)
print('-' * 30)

# Modify row and column names
df.rename(index={'i1': 0, 'i2': 1}, inplace=True)  # Modify row name
df.rename(columns={'c1': '3', 'c2': '4'}, inplace=True)  # Modify column name
print(df)
print('-' * 30)

# insert column
df.insert(loc=2, column='5', value=[100, 200, 300, 400])
print(df)
print('-' * 30)

# Delete rows and columns
df.drop(axis=0, labels='i3', inplace=True)  # Delete row
df.drop(axis=1, labels='5', inplace=True)  # Delete column
print(df)
print('-' * 30)

# Get the first and last rows
print(df.head(1))
print('-' * 30)
print(df.tail(2)) 
#operation result
    c1  c2  c3  c4
i1   7   9   5   3
i2   2   8   2   6
i3   7   9   7   7
i4   5   7   5   9
------------------------------
    3  4  c3  c4
0   7  9   5   3
1   2  8   2   6
i3  7  9   7   7
i4  5  7   5   9
------------------------------
    3  4    5  c3  c4
0   7  9  100   5   3
1   2  8  200   2   6
i3  7  9  300   7   7
i4  5  7  400   5   9
------------------------------
    3  4  c3  c4
0   7  9   5   3
1   2  8   2   6
i4  5  7   5   9
------------------------------
   3  4  c3  c4
0  7  9   5   3
------------------------------
    3  4  c3  c4
1   2  8   2   6
i4  5  7   5   9 

3.7. Common properties of DataFrame

| Properties | Description |
| — | — |
| head(num), tail(num) | The first num rows and the last num rows of data |
| values ​​| Returns data, which is a two-dimensional ndarray array |
| size | The number of elements of the element (not the number of rows, a row may contain multiple elements) |
| axes | Returns a list containing row labels and column labels |
| ndim | Number of axes |
| shape | Returns a tuple representing the dimensions of the DataFrame |
| dtypes | Returns the data type of each column of data |
| T | Transpose |
| shift(peroids,freq,axis,fill_value) | Move the row/column by the specified stride length
The peroids type is int, which represents the amplitude of movement. The default is 1. A positive number represents downward/right movement.
freq represents the date offset, suitable for time series, defaults to None
An axis of 0 means moving up and down, and an axis of 1 means moving left and right. The default is 0.
fill_value is used to fill in missing values ​​|

import pandas as pd

df = pd.DataFrame({'a_data': [28, 71, 37, 16, 58], 'b_data': [35, 25, 17, 39, 54], 'c_data': [36, 51, 34, 56, 71]})
print(df)
print('-' * 30)
df = df.shift(periods=1).shift(periods=1, axis=1)
print(df)
print('-' * 30)
df = df.shift(periods=1, axis=1, fill_value='a')
print(df) 
 a_data  b_data  c_data
0      28      35      36
1      71      25      51
2      37      17      34
3      16      39      56
4      58      54      71
------------------------------
   a_data  b_data  c_data
0     NaN     NaN     NaN
1     NaN    28.0    35.0
2     NaN    71.0    25.0
3     NaN    37.0    17.0
4     NaN    16.0    39.0
------------------------------
  a_data b_data c_data
0      a    NaN    NaN
1      a    NaN     28
2      a    NaN     71
3      a    NaN     37
4      a    NaN     16 

3.8. Statistical functions

Method Description
count() Count the number of non-null values ​​
sum() Sum
mean() mean
median() median number
mode() mode
std() standard deviation
min() minimum value
max() maximum value
abs() absolute value
prod() Product of all values ​​
cumsum() axis=0: Accumulate by rows, axis=1: Accumulate by columns
cumprod() axis=0: accumulate by rows, axis=1: accumulate by columns
corr() Calculate the correlation coefficient between sequences or variables, ranging from -1 to 1. The closer to 1, the stronger the correlation
describe(include='all') Statistics summary
import pandas as pd

# Create a lexiconseriesstructure
d = {'Name': pd.Series(['Xiao Ming', 'Small light', 'Xiaohong', 'Xiaohua', 'Lao Zhao', 'Xiao Cao', 'Xiao Chen',
                        'Lao Li', 'Lao Wang', 'Xiao Feng', 'small what', 'Lao Zhang']),
     'Age': pd.Series([25, 26, 25, 23, 30, 29, 23, 34, 40, 30, 51, 46]),
     'Rating': pd.Series([4.23, 3.24, 3.98, 2.56, 3.20, 4.6, 3.8, 3.78, 2.98, 4.80, 4.10, 3.65])
     }
df = pd.DataFrame(d)
print(df)
print('-' * 30)
print(df.sum())
print('-' * 30)
print(df.sum(axis=1))
print('-' * 30)
print(df.describe(include='all')) 
 Name  Age  Rating
0    Xiao Ming   25    4.23
1    Small light   26    3.24
2    Xiaohong   25    3.98
3    Xiaohua   23    2.56
4    Lao Zhao   30    3.20
5    Xiao Cao   29    4.60
6    Xiao Chen   23    3.80
7    Lao Li   34    3.78
8    Lao Wang   40    2.98
9    Xiao Feng   30    4.80
10   small what   51    4.10
11   Lao Zhang   46    3.65
------------------------------
Name      Xiao Ming Xiao Liang Xiao Hong Xiao Hua Lao Zhao Xiao Cao Xiao Chen Lao Li Lao Wang Xiao Feng Xiao He Lao Zhang
Age                            382
Rating                       44.92
dtype: object
------------------------------
0     29.23
1     29.24
2     28.98
3     25.56
4     33.20
5     33.60
6     26.80
7     37.78
8     42.98
9     34.80
10    55.10
11    49.65
dtype: float64
------------------------------
       Name        Age     Rating
count    12  12.000000  12.000000
unique   12        NaN        NaN
top      Small light        NaN        NaN
freq      1        NaN        NaN
mean    NaN  31.833333   3.743333
std     NaN   9.232682   0.661628
min     NaN  23.000000   2.560000
25%     NaN  25.000000   3.230000
50%     NaN  29.500000   3.790000
75%     NaN  35.500000   4.132500
max     NaN  51.000000   4.800000 

4. Pandas drawing——plot

4.1. Introduction to plot

DataFrame.plot(x=None, y=None, kind='line', ax=None, subplots=False, 
                sharex=None, sharey=False, layout=None, figsize=None, 
                use_index=True, title=None, grid=None, legend=True, 
                style=None, logx=False, logy=False, loglog=False, 
                xticks=None, yticks=None, xlim=None, ylim=None, rot=None, 
                fontsize=None, colormap=None, position=0.5, table=False, yerr=None, 
                xerr=None, stacked=True/False, sort_columns=False, 
                secondary_y=False, mark_right=True, **kwds) 

The plot function is a function in Pandas that is used to plot data. It is implemented based on the Matplotlib library and can be used directly on Pandas Series and DataFrame objects, providing a simple and powerful data visualization function. The plot function can draw multiple types of charts, such as line charts, bar charts, scatter plots, box plots, etc. The specific chart type is specified by the kind parameter. Using the plot method of DataFrame to draw an image will draw a line for each column of the data. Curves, by default, display the legend at the appropriate location according to the names of the columns, which saves time compared to matplotlib drawing, and the data in DataFrame format is more standardized, convenient for vectorization and calculation.

4.2. Main parameters of plot

| Parameters | Description |
| — | — |
| kind | drawing type
line: line chart, bar: bar chart (stacked column chart when Stacked=True), barh: horizontal bar chart
hist: histogram, box: box plot, area: area chart (when Stacked=True, the positivity of each column must be the same)
kde: density chart, pie: pie chart, scatter: scatter chart, hexbin: honeycomb chart |
| subplots | Specify the number of rows and columns of the chart to create a matrix layout composed of multiple subplots
With layout, you can set the layout of sub-pictures |
| grid | Show or hide grid lines |
| legend | Legend, used to explain the meaning of each data series in the chart |
| logx | Use a logarithmic scale on the x-axis |
| logy | Use logarithmic scale on y-axis |
| loglog | Use logarithmic scale on both x- and y-axes |
| xlim&ylim | Set the scale range of x-axis and y-axis. You can pass a list or tuple containing two elements, representing the minimum and maximum values ​​of the axis |
| fontsize | Set the font size in the chart |
| style / color / colors | Custom colors, different statistical charts may require different parameters |
| secondary_y | Set the y-axis on the right |

4.3. Line chart

4.3.1. Line chart

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

# Generate one thousand pieces of data that conform to the standard normal distribution(At this time, the data is gathered in0nearby),indexis the date,Column names are recorded asABCD
df = pd.DataFrame(np.random.randn(1000, 4), index=pd.date_range('1/1/2000', periods=1000), columns=list('ABCD'))
print(df)
print('-' * 30)
# Modify the data of the original array,so that they won't get crowded together0nearby
df = df.cumsum()
print(df)
print('-' * 30)
# df.plotUsed to build a chart object(AxesSubplot),row index doxaxis,The number of columns indicates how many pieces of data there are
# if row indexindexContains date,Pandas will be called automatically gct().autofmt_xdate() to formatxaxis
df.plot(kind='line',  # line chart
        grid=True,  # show table
        color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'],  # Custom color
        )

# Display existing chart objects
plt.show() 

Insert image description here

4.3.2. Multiple subgraphs

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

# Generate one thousand pieces of data that conform to the standard normal distribution(At this time, the data is gathered in0nearby),indexis the date,Column names are recorded asABCD
df = pd.DataFrame(np.random.randn(1000, 4), index=pd.date_range('1/1/2000', periods=1000), columns=list('ABCD'))
print(df)
print('-' * 30)
# Modify the data of the original array,so that they won't get crowded together0nearby
df = df.cumsum()
print(df)
print('-' * 30)
# df.plotUsed to build a chart object(AxesSubplot),row index doxaxis,The number of columns indicates how many pieces of data there are
# if row indexindexContains date,Pandas will be called automatically gct().autofmt_xdate() to formatxaxis
df.plot(kind='line',  # line chart
        grid=True,  # show table
        color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'],  # Custom color
        subplots=True  # Each column of data is displayed in a subgraph
        )

# Display existing chart objects
plt.show() 

Insert image description here

4.3.3. Multiple sub-pictures can be arranged arbitrarily

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

# Generate one thousand pieces of data that conform to the standard normal distribution(At this time, the data is gathered in0nearby),indexis the date,Column names are recorded asABCD
df = pd.DataFrame(np.random.randn(1000, 4), index=pd.date_range('1/1/2000', periods=1000), columns=list('ABCD'))
print(df)
print('-' * 30)
# Modify the data of the original array,so that they won't get crowded together0nearby
df = df.cumsum()
print(df)
print('-' * 30)
# df.plotUsed to build a chart object(AxesSubplot),row index doxaxis,The number of columns indicates how many pieces of data there are
# if row indexindexContains date,Pandas will be called automatically gct().autofmt_xdate() to formatxaxis
df.plot(kind='line',  # line chart
        grid=True,  # show table
        color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'],  # Custom color
        subplots=True,  # Each column of data is displayed in a subgraph
        layout=(2, 2)  # subplot layout
        )

# Display existing chart objects
plt.show() 

Insert image description here

4.3.4. Main and secondary axis settings

import numpy as np
import pandas as pd
from matplotlib.axes._subplots import Axes
from matplotlib import pyplot as plt

df = pd.DataFrame(np.random.randn(1000, 4), index=pd.date_range('1/1/2000', periods=1000), columns=list('ABCD'))
df = df.cumsum()
# When editing code,Pycharmhave no ideadf.plotWhat is returned?,Therefore it can be usedax:Axesto indicate,Use it like thisaxWhen the attribute method,PycharmThere will be a prompt
ax: Axes = df.plot(kind='line',
                   grid=True,
                   style=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'],  # Custom color
                   secondary_y=['A', 'B'],  # WillABColumn on secondary axis
                   )
# Use the methods of the chart object to modify the chart
ax.set_ylabel('CD scale')  # Spindle name
ax.legend(loc='upper left')  # The label of the main axis is shown in the upper left
ax.right_ax.set_ylabel('AB scale')  # Secondary axis name
ax.right_ax.legend(loc='upper right')  # The label of the secondary axis is shown in the upper right
# Display existing chart objects
plt.show() 

Insert image description here

4.4. Bar chart

4.4.1. Display Series data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

# Set Chinese font to Song Dynasty,Because we want to display text in the picture
matplotlib.rcParams['font.family'] = ['SimSun']
# Create 2x2 subgraph layout,Returns an array of canvas and sub-picture objects
fig, axes = plt.subplots(nrows=2, ncols=2)
# Create ten sets of data
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
# axes.flatReturn each subgraph object,enumerateCan return subscripts while traversing
for i, ax in enumerate(axes.flat):
    df.iloc[i].plot(kind='bar',  # bar chart
                    ax=ax,  # Assign the result to the subgraph on the canvas
                    color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'],  # Custom color
                    )  # Custom color
    ax.set_title('No. {} group data'.format(i))  # Set subgraph title
# show
plt.show() 

4.4.2. Display DateFrame data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create four data
df = pd.DataFrame(np.random.rand(4, 4), columns=['a', 'b', 'c', 'd'])
df.plot(kind='bar', color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'])
# show
plt.show() 

Insert image description here

Tip: For Series data, each label occupies one bar, and there will be separation between labels. For DataFrame data, each group of data occupies one bar, and there will be separation between each group of data.

4.4.3. Stacked bar chart

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create four data
df = pd.DataFrame(np.random.rand(4, 4), columns=['a', 'b', 'c', 'd'])
df.plot(kind='bar',
        color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'],
        stacked=True)
# show
plt.show() 

Insert image description here

4.4.4. Horizontal bar chart

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create four data
df = pd.DataFrame(np.random.rand(4, 4), columns=['a', 'b', 'c', 'd'])
df.plot(kind='barh',
        color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'],
        stacked=True)
# show
plt.show() 

Insert image description here

4.4.5, Histogram

Histograms are often used in the following situations:
① Display the distribution of data: Histograms can visually display the distribution characteristics of data, such as the center position of the data, degree of dispersion, skewness and kurtosis, etc.
② Check the data for outliers: Histograms can help identify outliers in the data, such as whether there are extreme values ​​beyond the normal range.
③ Analyze the characteristics of the data: Histograms can help understand the characteristics of the data, such as the degree of data concentration and distribution shape, etc., thereby providing a reference for subsequent data analysis and modeling.
By observing the histogram, you can quickly understand the overall distribution of the data, which is helpful for further data analysis and decision-making.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create four data
df = pd.DataFrame(np.random.rand(4, 4), columns=['a', 'b', 'c', 'd'])
print(df)
df.plot(kind='hist',
        color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'])
# show
plt.show() 

Insert image description here

4.4.6, Box plot

Boxplots usually contain the following parts:
① Box: The box represents the interquartile range of the data. The upper boundary of the box is the third quartile of the data (Q3: 75%), and the lower boundary is the first quartile of the data. number (Q1: 25%), the horizontal line inside the box represents the median of the data (Q2: 50%), for example, Q1=10_25 for 65,70,72,75,78,80,82,85,88,90 %=2.5=>3rd position: 72; Q2=median=(75+78)/2=76.5; Q3=10_75%=7.5=>8th position: 85; therefore the range of the box accounts for half of the data 25%~75%
② Upper limit (Whisker): The upper limit and lower limit represent the boundaries of the data. They are usually the maximum and minimum values ​​at the top and bottom of the box, but they can also vary according to the specific definition. Data points outside the upper and lower limits are called outliers

Box plots are often used in the following situations:
① Observe the distribution of data: Box plots can help identify the median, quartiles and outliers of the data to understand the distribution characteristics of the data.
② Compare the distribution of data in different groups: Box plots can display the data distribution of multiple groups at the same time, making it easy to compare the differences and similarities between different groups.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create ten pieces of data
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
print(df)
df.plot(kind='box')
# show
plt.show() 

Insert image description here

4.5. Area chart

4.5.1. Accumulation type

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create ten pieces of data
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
print(df)
df.plot(kind='area',
        color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'])
# show
plt.show() 

Insert image description here

4.5.2. Non-stacking type

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create ten pieces of data
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
print(df)
df.plot(kind='area',
        color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'],
        stacked=False)
# show
plt.show() 

Insert image description here

4.6. Pie chart

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create four pieces of data
df = pd.DataFrame(np.random.rand(4, 4), columns=['P1', 'P2', 'P3', 'P4'], index=['a', 'b', 'c', 'd'])
df.plot(kind='pie',
        colors=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'],  # Custom color
        subplots=True,
        layout=(2, 2),
        figsize=(8, 6))
# show
plt.show() 

Insert image description here

4.7. Kernel density map

Kernel density estimation chart: A non-parametric method for estimating the probability density function. It can help us understand the distribution of the data. It is smoothed on the basis of the histogram. **, which uses a kernel function (usually a Gaussian kernel function) to estimate the probability density near each data point, and superimposes the estimated values ​​of all kernel functions to finally obtain a continuous curve representing the probability density function of the data . Kernel density plots can provide smoother data distribution curves than histograms, helping us understand the distribution characteristics of the data more intuitively.

Install dependencies (use this module to calculate kernel density): pip install scipy

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

# Create 2x2 subgraph layout,Returns an array of canvas and sub-picture objects
fig, axes = plt.subplots(nrows=1, ncols=2)

# create20Article data
df = pd.DataFrame(np.random.rand(20, 4), columns=['a', 'b', 'c', 'd'])
df2 = df.copy()

# Kernel density plot
df.plot(kind='kde',
        ax=axes[0],
        color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'],
        figsize=(14, 6))

# The corresponding histogram
df2.plot(kind='hist',
         bins=20,
         ax=axes[1],
         color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'],
         figsize=(14, 6))

# show
plt.show() 

Insert image description here

5. Read and write files

5.1. What is CSV?

CSV (Comma-Separated Values) is a commonly used text file format used to store tabular data. In a CSV file, data is separated by commas or other delimiters (such as semicolons, tabs, etc.). Each row represents a row of data records in the table, and each field represents an attribute value in the record. CSV files usually end with . csv extension, it is a universal data exchange format that can be supported by almost all data processing software and programming languages

5.2. Advantages of CSV files

① Easy to use: CSV files store data in plain text format, are easy to create and edit, and are highly versatile.
② Cross-platform compatibility: CSV files can be used for data exchange between different operating systems and software, and have good cross-platform compatibility
③ Lightweight: Compared with other data storage formats (such as Excel files), CSV files are usually more lightweight and have high storage efficiency.

5.3. Disadvantages of CSV files

① Does not support complex data structures: CSV files only support the storage of flat table data and do not support complex data structures (such as nested tables, multi-table associations, etc.)
② Does not support metadata: CSV files do not support metadata information for storing data, such as data types, indexes, etc., and require additional processing to resolve.

5.4. Read CSV files – read_csv

① Commonly used parameters

pd.read_csv(filepath_or_buffer,  # file name
            sep=',',  # Separator between fields,Default is comma ,
            delimiter=None,  # Same functionsep,If specified at the same time sep and delimiter,Then take delimiter Subject to
            header='infer',  # row number of column name。The default value is 'infer',express Pandas will attempt to automatically infer column names
            names=None,  # Specify a custom list of column names,if header=None,then pass names Parameter specifies column name
            index_col=None,  # Specify which column to use as the index column
            usecols=None)  # Specify the columns to read 

② Example

#writeperson.csvdocument
ID,Name,Age,City,Salary
1,Jack,28,Beijing,22000
2,Lida,32,Shanghai,19000
3,John,43,Shenzhen,12000
4,Helen,38,Hengshui,3500 
import pandas as pd
from pandas import DataFrame
import numpy as np

# read directly,will get aDataFrameobject
df1: DataFrame = pd.read_csv('./person.csv',  # read file
                             header=0,  # Delete the original column name
                             names=['i', 'n', 'a', 'c', 's'],  # Custom column names
                             dtype={'s': np.float64}  # Retype a field
                             )
print(df1)
print('-' * 30)

# View the type of each column
print(df1.dtypes)
print('-' * 30)

# Select a field for indexing
df2 = pd.read_csv('./person.csv',  # file name
                  index_col=0, )  # Select the first column as index,You can also writeindex_col=['ID']
print(df2)
print('-' * 30)

# fromcsvStart reading the specified line of the file
df3 = pd.read_csv('./person.csv',  # file name
                  skiprows=2)  # Skip the first two lines before starting to read
print(df3) 
#operation result
   i      n   a         c        s
0  1   Jack  28   Beijing  22000.0
1  2   Lida  32  Shanghai  19000.0
2  3   John  43  Shenzhen  12000.0
3  4  Helen  38  Hengshui   3500.0
------------------------------
i      int64
n     object
a      int64
c     object
s    float64
dtype: object
------------------------------
     Name  Age      City  Salary
ID                              
1    Jack   28   Beijing   22000
2    Lida   32  Shanghai   19000
3    John   43  Shenzhen   12000
4   Helen   38  Hengshui    3500
------------------------------
   2   Lida  32  Shanghai  19000
0  3   John  43  Shenzhen  12000
1  4  Helen  38  Hengshui   3500 

5.5. Write CSV file——to_csv

① Commonly used parameters

df.to_csv(path_or_buf,  # to be written CSV The path to the file or a file-like object
          sep=",",  # Specifies the delimiter between fields
          na_rep="",  # String to replace missing values
          columns=None,  # Column to write to,Can be a list of column names,Specify column order for writing,It can also be a dictionary,for renaming columns
          header=True,  # Whether to write column names
          index=True,  # Whether to write to the index
          mode="w",  # file open mode
          encoding="utf-8",  # coding
          line_terminator=r"\n",  # line terminator
          date_format=None,  # date format
          chunksize=None,  # Block size when writing
          ) 

② Example

import pandas as pd

data = {'Name': ['Smith', 'Parker'],
        'ID': [101, 102],
        'Language': ['Python', 'JavaScript']}
df = pd.DataFrame(data)
print(df)
print('-' * 30)
# Convert tocsvdata
csv_data = df.to_csv(sep='|')
print(csv_data) 
#operation result
     Name   ID    Language
0   Smith  101      Python
1  Parker  102  JavaScript
------------------------------
|Name|ID|Language
0|Smith|101|Python
1|Parker|102|JavaScript 

5.6. Write Excel file——to_excel

Install dependencies: pip install openpyxl

① Commonly used parameters

df.to_excel(excel_writer,  # to be written Excel The path to the file or a ExcelWriter object
            sheet_name='Sheet1',  # The name of the worksheet to write to
            na_rep='',  # String to replace missing values
            float_format=None,  # Specify the format of floating point numbers
            columns=None,  # Column to write to
            header=True,  # Specifies whether to write column names
            index=True,  # Specifies whether to write to the index
            index_label=None,  # Specify the name of the index column
            startrow=0,  # Starting row of data writing
            startcol=0,  # Starting column for data writing
            engine=None,  # Engine used
            merge_cells=True,  # Whether to merge cells
            encoding=None,  # file encoding
            inf_rep='inf',  # Replace strings with positive and negative infinity values
            verbose=True,  # Whether to display detailed information during the writing process
            freeze_panes=None)  # Specify panes to freeze 

② Example

import pandas as pd
from openpyxl.styles import Font, Alignment
from openpyxl.utils import get_column_letter

font = Font(name='Microsoft YaHei', size=11)  # Font style
alignment = Alignment(horizontal='center', vertical='center')  # Font layout

# createDataFramedata
df = pd.DataFrame({'name': ['Programming help', 'cLanguage Chinese Network', 'microgakuen', '92python'],
                   'rank': [1, 2, 3, 4],
                   'language': ['PHP', 'C', 'PHP', 'Python'],
                   'url': ['www.bianchneg.com', 'c.bianchneg.net', 'www.weixueyuan.com', 'www.92python.com']})
# createExcelWriteobject
writer = pd.ExcelWriter('./website.xlsx')
# Pass data in
df.to_excel(writer)
# Revisexlxsstyles in
book = writer.book  # excelsurface
sheet = book.active  # worksheet
for row in sheet.iter_rows(min_row=1, min_col=1, max_row=sheet.max_row, max_col=sheet.max_column):
    for cell in row:
        cell.font = font  # Set the font to Microsoft YaHei 11Number
        cell.alignment = alignment  # center
for i, col in enumerate(df.columns):
    max_length = max(df[col].astype(str).map(len).max(), len(col))  # Calculate the maximum length of each column
    sheet.column_dimensions[get_column_letter(i + 2)].width = max_length + 8  # Set column width

# keep
writer.save() 

5.7. Read Excel files – read_excel

① Commonly used parameters

pd.read_excel(io,  # excelfile path
              sheet_name=0,  # excelworksheet name
              header=0,  # Specify row as column name,If it is not specified, it can be setheader=None,header=2Indicates using the first two rows as multiple indexes
              names=None,  # Custom column names
              index_col=None,  # Specify the data of a certain column as the row index
              usecols=None,  # Column to read,The default isNoneIndicates to read all columns
              squeeze=False,  # whether it will be returned DataFrame object converted to Series object
              dtype=None,  # Specify the data type of the column
              engine=None,  # parsing engine used
              converters=None,  # Conversion function for specified column
              true_values=None,  # String representation of the specified Boolean type,The default isTrue
              false_values=None,  # String representation of the specified Boolean type,The default isFalse
              skiprows=None,  # skip specified number of lines
              nrows=None,  # number of lines to read
              na_values=None,  # String representation of missing values
              parse_dates=False,  # Whether to parse date type data into datetime object
              date_parser=None,  # Function to parse date data
              thousands=None,  # Character that specifies thousands separator
              comment=None,  # Specify the character of the comment
              skipfooter=0,  # To ignore certain lines at the end
              convert_float=True)  # Whether to convert integer type columns to floating point type 

② Example

import pandas as pd
from pandas import DataFrame

# readExcelAll data of the file
df1: DataFrame = pd.read_excel('./website.xlsx',  # path
                               engine='openpyxl',  # parsing engine
                               index_col="name",  # usenameThis column serves as the index
                               )
df1.columns = df1.columns.str.replace('Unnamed.*', 'col_label')  # WillexcelneutralindexThe column name is empty,converted todfThen the empty column name becomes‘Unnameed:’,So you can rename it
print(df1)
print('-' * 30)

# readExcelPart of the file data+double index
df2: DataFrame = pd.read_excel('./website.xlsx',  # path
                               engine='openpyxl',  # parsing engine
                               index_col=[1, 2],  # Use two columns for indexing(from0start)
                               usecols=[1, 2, 3]  # Selectexcelthree columns in(from0start)
                               )
df2.columns = df2.columns.str.replace('Unnamed.*', 'col_label')  # WillexcelneutralindexThe column name is empty,converted todfThen the empty column name becomes‘Unnameed:’,So you can rename it
print(df2) 
#operation result
          col_label  rank language                 url
name                                                  
Programming help               0     1      PHP   www.bianchneg.com
cLanguage Chinese Network            1     2        C     c.bianchneg.net
microgakuen               2     3      PHP  www.weixueyuan.com
92python          3     4   Python    www.92python.com
------------------------------
                   name
rank language          
1    PHP            Programming help
2    C           cLanguage Chinese Network
3    PHP            microgakuen
4    Python    92python 

6. Data table interconnection, grouping aggregation, and data pivoting

6.1. Data table interconnection

Method Description
merge Used to join rows in different DataFrames based on one or more keys (column values)
concat Used to concatenate multiple DataFrames along a specified axis
join Used to join two DataFrames based on their index (row label)
append Used to add a DataFrame to the end of another DataFrame
import pandas as pd

# merge:joint table(Based on keywords/List)
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': [5, 6, 7, 8]})
merged_df = pd.merge(df1, df2, on='key', how='outer')  # outerTake union
print(merged_df)
print('-' * 30)

# concat:Merge along specified axis(portrait/Horizontal)
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'], 'B': ['B4', 'B5', 'B6', 'B7']})
concatenated_df1 = pd.concat([df1, df2], axis=0)
print(concatenated_df1)
print('-' * 30)
concatenated_df2 = pd.concat([df1, df2], axis=1)
print(concatenated_df2)
print('-' * 30)

# join:linked list(based onindex/OK)
df1 = pd.DataFrame({'value1': [1, 2, 3, 4]}, index=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame({'value2': [5, 6, 7, 8]}, index=['B', 'D', 'E', 'F'])
joined_df = df1.join(df2, how='outer')
print(joined_df)
print('-' * 30)

# append:vertical merger
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'], 'B': ['B4', 'B5', 'B6', 'B7']})
appended_df = df1.append(df2)
print(appended_df) 
 key  value_x  value_y
0   A      1.0      NaN
1   B      2.0      5.0
2   C      3.0      NaN
3   D      4.0      6.0
4   E      NaN      7.0
5   F      NaN      8.0
------------------------------
    A   B
0  A0  B0
1  A1  B1
2  A2  B2
3  A3  B3
0  A4  B4
1  A5  B5
2  A6  B6
3  A7  B7
------------------------------
    A   B   A   B
0  A0  B0  A4  B4
1  A1  B1  A5  B5
2  A2  B2  A6  B6
3  A3  B3  A7  B7
------------------------------
   value1  value2
A     1.0     NaN
B     2.0     5.0
C     3.0     NaN
D     4.0     6.0
E     NaN     7.0
F     NaN     8.0
------------------------------
    A   B
0  A0  B0
1  A1  B1
2  A2  B2
3  A3  B3
0  A4  B4
1  A5  B5
2  A6  B6
3  A7  B7 

6.2. Group aggregation

Method Description
groupby used to group data
agg Used to aggregate grouped data
import pandas as pd

# Create an example DataFrame
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo'],
                   'B': ['one', 'one', 'two', 'two', 'one'],
                   'C': [1, 2, 3, 4, 5]})
# by column 'A' group
grouped = df.groupby('A')
# Display grouped data
for name, group in grouped:
    print(name)
    print(group)
    print('-' * 30)
# Aggregate the grouped data,Calculate the mean and sum of each group
aggregated = grouped.agg({'C': ['mean', 'sum']})
print(aggregated) 
bar
     A    B  C
1  bar  one  2
3  bar  two  4
------------------------------
foo
     A    B  C
0  foo  one  1
2  foo  two  3
4  foo  one  5
------------------------------
       C    
    mean sum
A           
bar    3   6
foo    3   9 

6.3. Data perspective

Pivot: A data analysis technique used to summarize and analyze large amounts of data to better understand relationships and trends between data. Data pivot can simplify complex data structures into easy-to-understand forms, providing intuitive understanding and in-depth insights into the data.

Pivot Table: A form of implementation of pivot technology. It displays data in a multi-dimensional and multi-level manner by summarizing, grouping and aggregating data, thereby revealing the intrinsic relationships and relationships between data. law

df.pivot_table(
    index=None,  # Column or column name used for grouping,The column to be used as the row index in the result。
    columns=None,  # Column or column name used for grouping,The column that will be used as the column index in the results。
    values=None,  # Column or column name to be aggregated,i.e. the data to be displayed in the results。
    aggfunc="mean"  # Specify a function that performs aggregation calculations on data,For example sum、mean、count wait,The default is numpy.mean
) 
import pandas as pd

# Create an example DataFrame
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
                   'B': ['one', 'one', 'two', 'two', 'one', 'one'],
                   'C': [1, 2, 3, 4, 5, 6],
                   'D': [7, 8, 9, 10, 11, 12]})
print(df)
print('-' * 30)
# use pivot_table Create a PivotTable
pivot = df.pivot_table(index='A',  # rightAMake groups,AIndex as pivot table
                       columns='B',  # rightBMake groups,BAs PivotTable columns
                       values='C',  # rightCColumn aggregation
                       aggfunc='sum')  # Count each group,For example(bar,one),ofCSum
print(pivot) 
 A    B  C   D
0  foo  one  1   7
1  foo  one  2   8
2  foo  two  3   9
3  bar  two  4  10
4  bar  one  5  11
5  bar  one  6  12
------------------------------
B    one  two
A            
bar   11    4
foo    3    3 
RELATED ARTICLES

Most Popular

Recent Comments