Pandas
-
- What is Pandas?
-
- 1.1, Panda overview
- 1.2. Panda features
- 1.3. Some differences between Pandas and Numpy
-
- Pandas data structure——Series
-
- 2.1, Series Overview
- 2.2. Series creation
- 2.3. Common attributes of Series
- 2.4. Common methods of Series
-
- Pandas data structure——DataFrame
-
- 3.1, DataFrame overview
- 3.2. DataFrame creation
- 3.3. Column index operation DataFrame
- 3.4. Row index operation DataFrame
- 3.5. Conditional query, indexer query
- 3.6. Common methods of DataFrame
- 3.7. Common properties of DataFrame
- 3.8. Statistical functions
-
- Pandas drawing——plot
-
- 4.1, plot introduction
- 4.2. Main parameters of plot
- 4.3. Line chart
-
- 4.3.1, Line chart
- 4.3.2. Multiple subgraphs
- 4.3.3. Arrange multiple subgraphs arbitrarily
- 4.3.4. Main and secondary shaft settings
- 4.4. Bar chart
-
- 4.4.1. Display Series data
- 4.4.2. Display DateFrame data
- 4.4.3. Stacked bar chart
- 4.4.4. Horizontal bar chart
- 4.4.5. Histogram
- 4.4.6. Box plot
- 4.5. Area chart
-
- 4.5.1, accumulation type
- 4.5.2. Non-stacking type
- 4.6. Pie chart
- 4.7. Kernel density map
-
- Read and write files
-
- 5.1. What is CSV?
- 5.2. Advantages of CSV files
- 5.3. Disadvantages of CSV files
- 5.4. Read CSV files – read_csv
- 5.5. Write CSV file——to_csv
- 5.6. Write Excel file——to_excel
- 5.7. Read Excel files – read_excel
-
- Data table interconnection, grouping aggregation, and data pivoting
-
- 6.1, Data table interconnection
- 6.2. Group aggregation
- 6.3. Data perspective
-
- Statement
1. What is Pandas?
Install dependencies: pip install pandas
Import: import pandas as pd
1.1. Overview of Panda
Pandas is a Python-based data analysis library that provides rich data structures and data manipulation tools for processing structured data. It is particularly suitable for tasks such as data cleaning, data transformation, data analysis and data visualization (Chinese learning website)
1.2. Panda features
① It provides a simple, efficient DataFrame object with default labels (labels can also be customized)
② Ability to quickly load data from files in different formats (such as Excel, CSV, SQL files) and then convert them into processable objects
③ Ability to group data by row and column labels, and perform aggregation and conversion operations on grouped objects
④ Can easily implement data normalization operations and missing value processing
⑤ Able to easily add, modify or delete data columns of DataFrame
⑥ Able to process data sets in different formats, such as matrix data, heterogeneous data tables, time series, etc.
⑦ Provides a variety of ways to process data sets, such as building subsets, slicing, filtering, grouping, and reordering, etc.
1.3. Some differences between Pandas and Numpy
① Data structure
NumPy mainly provides multi-dimensional array objects ndarray, which is a fixed-size array and all elements are of the same type.
Pandas provides two main data structures: Series and DataFrame. Series is a one-dimensional labeled array, similar to NumPy's one-dimensional array; while DataFrame is a two-dimensional labeled data structure, similar to a spreadsheet or database table
②Usage:
NumPy is mainly used in the fields of numerical computing and scientific computing, providing high-performance numerical computing functions, such as array operations, linear algebra operations, Fourier transform, etc.
Pandas is mainly used in the fields of data analysis and data processing. It provides a wealth of data operation tools, including data cleaning, data conversion, data grouping, data aggregation, data visualization and other functions.
③ Data processing capabilities:
NumPy is more suitable for processing numerical data and provides a wealth of numerical operation functions and methods.
Pandas is more suitable for processing structured data, such as tabular data, time series data, etc., and provides flexible data operation and analysis tools.
Comparison terms | Pandas | NumPy |
---|---|---|
Adaptability | Pandas is mainly used to process tabular data | NumPy is mainly used to process numerical data |
Tools | Pandas provides Series and DataFrame data structures | NumPy builds ndarray arrays to hold data |
Performance | Pandas is more advantageous for processing data with more than 500,000 rows | NumPy has better performance for data with less than 500,000 rows or less |
Memory utilization | Pandas consumes a lot of memory compared to NumPy | NumPy consumes less memory |
Object | Pandas provides a DataFrame 2D data table object | NumPy provides a multidimensional array ndarray object |
Method to convert Pandas DataFrame object into ndarray: to_numpy()
import pandas as pd
# ConstructDataFrameobject
df = pd.DataFrame({"P": [2, 3], "Q": [4.0, 5.8]})
df['R'] = pd.date_range('2020-12-23', periods=2) # add another column
print(df)
print('-' * 30)
# Converted tonumpyarray
n = df.to_numpy()
print(n)
2. Pandas data structure——Series
2.1. Series Overview
A Series in Pandas is a one-dimensional labeled array, similar to a dictionary in Python or a one-dimensional array in NumPy. It consists of a set of data and a set of indexes (tags) associated with it. It can store any type of data, and the index can be of any type.
2.2. Series creation
① Create using list
import pandas as pd
list1 = [1, 2, 3, 4]
s1 = pd.Series(list1)
print(s1)
0 1
1 2
2 3
3 4
dtype: int64
② Create using ndarray object
import pandas as pd
import numpy as np
arr = np.array([1, 2, 3, 4])
s1 = pd.Series(arr, index=['A', 'B', 'C', 'D']) # can useindexto specify the index
print(s1)
A 1
B 2
C 3
D 4
dtype: int32
③ Create using dictionary
import pandas as pd
s1 = pd.Series({'name': 'hyh', 'age': 18, 'gender': 'boy'})
print(s1)
name hyh
age 18
gender boy
dtype: object
④Create using scalars
import pandas as pd
s1 = pd.Series(4, index=['a', 'b', 'c', 'd'])
print(s1)
a 4
b 4
c 4
d 4
dtype: int64
2.3. Common attributes of Series
Properties | Description |
---|---|
values | Return data in the form of ndarray |
index | Index is an Index object |
axes | The index of each axis, which is an array of Index objects |
size | length |
ndim | Dimensions of data |
dtype | data type |
import pandas as pd
l1 = [1, 2, 3, 4]
s = pd.Series(l1, index=['a', 'b', 'c', 'd'])
print("Series:\n", s)
print('-' * 30)
# access data
print(s.values, type(s.values))
print('-' * 30)
# access index
print(s.index, list(s.index))
print('-' * 30)
# Access the index of each axis
print(s.axes, s.axes[0] is s.index)
print('-' * 30)
# access size
print(s.size)
print('-' * 30)
# access dimensions
print(s.ndim)
print('-' * 30)
# access data type
print(s.dtype)
Series:
a 1
b 2
c 3
d 4
dtype: int64
------------------------------
[1 2 3 4] <class 'numpy.ndarray'>
------------------------------
Index(['a', 'b', 'c', 'd'], dtype='object') ['a', 'b', 'c', 'd']
------------------------------
[Index(['a', 'b', 'c', 'd'], dtype='object')] True
------------------------------
4
------------------------------
1
------------------------------
int64
2.4. Common methods of Series
Note: ser below refers to the instance object of Series
| Method | Description |
| — | — |
| ser.head(num) | The first num rows of data |
| ser.tail(num) | tail num row data |
| pd.isnull(ser)
pd.nonull(ser) | Detect whether there are missing values and return a Series object |
import pandas as pd
import numpy as np
arr = np.random.randint(1, 10, 5)
s = pd.Series(arr)
print(s)
print('-' * 30)
print(s.head(2))
print('-' * 30)
print(s.tail(2))
print('-' * 30)
print(pd.isnull(s))
print('-' * 30)
0 8
1 1
2 4
3 2
4 4
dtype: int32
------------------------------
0 8
1 1
dtype: int32
------------------------------
3 2
4 4
dtype: int32
0 False
1 False
2 False
3 False
4 False
dtype: bool
3. Pandas data structure——DataFrame
3.1. DataFrame overview
DataFrame is one of the most important data structures in Pandas. It is a two-dimensional tagged data structure, similar to a spreadsheet or a table in a database. DataFrame consists of rows and columns. Each column can be of different data types (such as integers, floating point numbers, strings, etc.), and each row and column has a corresponding label index.
3.2. DataFrame creation
① Create using list
import pandas as pd
list1 = [1, 2]
df = pd.DataFrame(list1)
print(df)
print('-' * 30)
list2 = [('Alex', 11), ('Bob', 12), ('Clarke', 13)]
df = pd.DataFrame(list2, columns=['name', 'age'], dtype=float)
print(df)
0
0 1
1 2
------------------------------
name age
0 Alex 11.0
1 Bob 12.0
2 Clarke 13.0
② Create using dictionary nested list
import pandas as pd
data1 = {'name': ['Tom', 'Jack', 'Steve', 'Hyh'], 'age': [28, 48, 61, 15]}
df = pd.DataFrame(data1)
print(df)
data2 = {'name': ['Pyy', 'Zayn', 'Yh'], 'age': [32, 51, 23]}
df = pd.DataFrame(data2, index=['rank1', 'rank2', 'rank3'])
print(df)
name age
0 Tom 28
1 Jack 48
2 Steve 61
3 Hyh 15
name age
rank1 Pyy 32
rank2 Zayn 51
rank3 Yh 23
③ Create using list nested dictionary
import pandas as pd
data1 = [{'name': 'Pyy', 'age': 22}, {'name': 'Zayn', 'age': 31, 'gender': 'boy'}]
df = pd.DataFrame(data1)
print(df)
print('-' * 30)
data1 = [{'name': 'Pyy', 'age': 22}, {'name': 'Zayn', 'age': 31, 'gender': 'boy'}]
df = pd.DataFrame(data1, index=['first', 'second'], columns=['age', 'name'])
print(df)
name age gender
0 Pyy 22 NaN
1 Zayn 31 boy
------------------------------
age name
first 22 Pyy
second 31 Zayn
④ Use dictionary nested Series to create
import pandas as pd
ser1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
ser2 = pd.Series([2, 3, 4, 5], index=['b', 'c', 'd', 'e'])
arr = {'one': ser1, 'two': ser2}
df = pd.DataFrame(arr)
print(df)
one two
a 1.0 NaN
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
e NaN 5.0
3.3. Column index operation DataFrame
import pandas as pd
ser1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
ser2 = pd.Series([2, 3, 4, 5], index=['b', 'c', 'd', 'e'])
arr = {'one': ser1, 'two': ser2}
df = pd.DataFrame(arr)
print(df)
print('-' * 30)
# access a column,What is returned is aSeriesobject
print(df['one'])
print('-' * 30)
# Access multiple columns,What is returned is aDataFrameobject
print(df[['one', 'two']])
print('-' * 30)
# add a column
df['three'] = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
df.insert(loc=1, column='four', value=[40, 50, 60, 70, 80]) # insertThe data length should be the same as the existingindexlength
print(df)
print('-' * 30)
# Delete a column
df.pop('four')
print(df)
one two
a 1.0 NaN
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
e NaN 5.0
------------------------------
a 1.0
b 2.0
c 3.0
d NaN
e NaN
Name: one, dtype: float64
------------------------------
one two
a 1.0 NaN
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
e NaN 5.0
------------------------------
one four two three
a 1.0 40 NaN 10.0
b 2.0 50 2.0 20.0
c 3.0 60 3.0 30.0
d NaN 70 4.0 NaN
e NaN 80 5.0 NaN
------------------------------
one two three
a 1.0 NaN 10.0
b 2.0 2.0 20.0
c 3.0 3.0 30.0
d NaN 4.0 NaN
e NaN 5.0 NaN
3.4. Row index operation DataFrame
import pandas as pd
ser1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
ser2 = pd.Series([2, 3, 4, 5], index=['b', 'c', 'd', 'e'])
arr = {'one': ser1, 'two': ser2}
df = pd.DataFrame(arr)
print(df)
print('-' * 30)
# Access rows using slices
print(df[1:4])
print('-' * 30)
# uselocaccess line,Need to use the label name of the row
print(df.loc['b'], '\n', dict(df.loc['b']))
print('-' * 30)
# useilocaccess line,Need to use the position index of the row
print(df.iloc[3], '\n', dict(df.iloc[3]))
print('-' * 30)
# Add to
newrow = pd.DataFrame([[10, 20], [30, 40]], columns=['one', 'two'], index=['f', 'g'])
df = df.append(newrow)
print(df)
print('-' * 30)
# delete
df = df.drop('b')
print(df)
one two
a 1.0 NaN
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
e NaN 5.0
------------------------------
one two
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
------------------------------
one 2.0
two 2.0
Name: b, dtype: float64
{'one': 2.0, 'two': 2.0}
------------------------------
one NaN
two 4.0
Name: d, dtype: float64
{'one': nan, 'two': 4.0}
------------------------------
one two
a 1.0 NaN
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
e NaN 5.0
f 10.0 20.0
g 30.0 40.0
------------------------------
one two
a 1.0 NaN
c 3.0 3.0
d NaN 4.0
e NaN 5.0
f 10.0 20.0
g 30.0 40.0
3.5. Conditional query, indexer query
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1, 10, (4, 4)),
columns=['c1', 'c2', 'c3', 'c4'],
index=['i1', 'i2', 'i3', 'i4'])
print(df)
print('-' * 30)
# condition index
print(df[df['c1'] >= 5])
print('-' * 30)
# multi-condition
print(df[(df['c1'] >= 5) & (df['c3'] <= 5)])
print('-' * 30)
# Indexer,In fact, that islocandiloc,First pass in the line,Then pass in the column
print(df.loc[['i1', 'i3'], ['c2', 'c4']])
print('-' * 30)
print(df.loc[df['c1'] >= 5, ['c2', 'c4']])
print('-' * 30)
print(df.iloc[0:2, 1:3])
#operation result
c1 c2 c3 c4
i1 1 5 8 9
i2 1 2 3 3
i3 4 9 5 7
i4 7 9 1 4
------------------------------
c1 c2 c3 c4
i4 7 9 1 4
------------------------------
c1 c2 c3 c4
i4 7 9 1 4
------------------------------
c2 c4
i1 5 9
i3 9 7
------------------------------
c2 c4
i4 9 4
------------------------------
c2 c3
i1 5 8
i2 2 3
3.6. Common methods of DataFrame
| Method | Description |
| — | — |
| rename(columns,index,inplace) | Modify row names and column names
columns: modify the column name, index: modify the row name, inplace: whether to modify the original data, the default is False |
| insert(loc,column,value,allow_duplicates) | Insert columns
loc: insertion position, column: column name, value: value, allow_duplicates whether to allow duplicate column names, the default is False |
| drop(labels,axis,index,inplace) | Delete
labels: name of row/column, index: position of row/column (equivalent to labels, choose one of the two), axis: delete row or column, default 0 means delete row, inplace: whether to delete on the original data, default for False |
| head(n), tail(n) | first and last n lines |
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1, 10, (4, 4)),
columns=['c1', 'c2', 'c3', 'c4'],
index=['i1', 'i2', 'i3', 'i4'])
# original array
print(df)
print('-' * 30)
# Modify row and column names
df.rename(index={'i1': 0, 'i2': 1}, inplace=True) # Modify row name
df.rename(columns={'c1': '3', 'c2': '4'}, inplace=True) # Modify column name
print(df)
print('-' * 30)
# insert column
df.insert(loc=2, column='5', value=[100, 200, 300, 400])
print(df)
print('-' * 30)
# Delete rows and columns
df.drop(axis=0, labels='i3', inplace=True) # Delete row
df.drop(axis=1, labels='5', inplace=True) # Delete column
print(df)
print('-' * 30)
# Get the first and last rows
print(df.head(1))
print('-' * 30)
print(df.tail(2))
#operation result
c1 c2 c3 c4
i1 7 9 5 3
i2 2 8 2 6
i3 7 9 7 7
i4 5 7 5 9
------------------------------
3 4 c3 c4
0 7 9 5 3
1 2 8 2 6
i3 7 9 7 7
i4 5 7 5 9
------------------------------
3 4 5 c3 c4
0 7 9 100 5 3
1 2 8 200 2 6
i3 7 9 300 7 7
i4 5 7 400 5 9
------------------------------
3 4 c3 c4
0 7 9 5 3
1 2 8 2 6
i4 5 7 5 9
------------------------------
3 4 c3 c4
0 7 9 5 3
------------------------------
3 4 c3 c4
1 2 8 2 6
i4 5 7 5 9
3.7. Common properties of DataFrame
| Properties | Description |
| — | — |
| head(num), tail(num) | The first num rows and the last num rows of data |
| values | Returns data, which is a two-dimensional ndarray array |
| size | The number of elements of the element (not the number of rows, a row may contain multiple elements) |
| axes | Returns a list containing row labels and column labels |
| ndim | Number of axes |
| shape | Returns a tuple representing the dimensions of the DataFrame |
| dtypes | Returns the data type of each column of data |
| T | Transpose |
| shift(peroids,freq,axis,fill_value) | Move the row/column by the specified stride length
The peroids type is int, which represents the amplitude of movement. The default is 1. A positive number represents downward/right movement.
freq represents the date offset, suitable for time series, defaults to None
An axis of 0 means moving up and down, and an axis of 1 means moving left and right. The default is 0.
fill_value is used to fill in missing values |
import pandas as pd
df = pd.DataFrame({'a_data': [28, 71, 37, 16, 58], 'b_data': [35, 25, 17, 39, 54], 'c_data': [36, 51, 34, 56, 71]})
print(df)
print('-' * 30)
df = df.shift(periods=1).shift(periods=1, axis=1)
print(df)
print('-' * 30)
df = df.shift(periods=1, axis=1, fill_value='a')
print(df)
a_data b_data c_data
0 28 35 36
1 71 25 51
2 37 17 34
3 16 39 56
4 58 54 71
------------------------------
a_data b_data c_data
0 NaN NaN NaN
1 NaN 28.0 35.0
2 NaN 71.0 25.0
3 NaN 37.0 17.0
4 NaN 16.0 39.0
------------------------------
a_data b_data c_data
0 a NaN NaN
1 a NaN 28
2 a NaN 71
3 a NaN 37
4 a NaN 16
3.8. Statistical functions
Method | Description |
---|---|
count() | Count the number of non-null values |
sum() | Sum |
mean() | mean |
median() | median number |
mode() | mode |
std() | standard deviation |
min() | minimum value |
max() | maximum value |
abs() | absolute value |
prod() | Product of all values |
cumsum() | axis=0: Accumulate by rows, axis=1: Accumulate by columns |
cumprod() | axis=0: accumulate by rows, axis=1: accumulate by columns |
corr() | Calculate the correlation coefficient between sequences or variables, ranging from -1 to 1. The closer to 1, the stronger the correlation |
describe(include='all') | Statistics summary |
import pandas as pd
# Create a lexiconseriesstructure
d = {'Name': pd.Series(['Xiao Ming', 'Small light', 'Xiaohong', 'Xiaohua', 'Lao Zhao', 'Xiao Cao', 'Xiao Chen',
'Lao Li', 'Lao Wang', 'Xiao Feng', 'small what', 'Lao Zhang']),
'Age': pd.Series([25, 26, 25, 23, 30, 29, 23, 34, 40, 30, 51, 46]),
'Rating': pd.Series([4.23, 3.24, 3.98, 2.56, 3.20, 4.6, 3.8, 3.78, 2.98, 4.80, 4.10, 3.65])
}
df = pd.DataFrame(d)
print(df)
print('-' * 30)
print(df.sum())
print('-' * 30)
print(df.sum(axis=1))
print('-' * 30)
print(df.describe(include='all'))
Name Age Rating
0 Xiao Ming 25 4.23
1 Small light 26 3.24
2 Xiaohong 25 3.98
3 Xiaohua 23 2.56
4 Lao Zhao 30 3.20
5 Xiao Cao 29 4.60
6 Xiao Chen 23 3.80
7 Lao Li 34 3.78
8 Lao Wang 40 2.98
9 Xiao Feng 30 4.80
10 small what 51 4.10
11 Lao Zhang 46 3.65
------------------------------
Name Xiao Ming Xiao Liang Xiao Hong Xiao Hua Lao Zhao Xiao Cao Xiao Chen Lao Li Lao Wang Xiao Feng Xiao He Lao Zhang
Age 382
Rating 44.92
dtype: object
------------------------------
0 29.23
1 29.24
2 28.98
3 25.56
4 33.20
5 33.60
6 26.80
7 37.78
8 42.98
9 34.80
10 55.10
11 49.65
dtype: float64
------------------------------
Name Age Rating
count 12 12.000000 12.000000
unique 12 NaN NaN
top Small light NaN NaN
freq 1 NaN NaN
mean NaN 31.833333 3.743333
std NaN 9.232682 0.661628
min NaN 23.000000 2.560000
25% NaN 25.000000 3.230000
50% NaN 29.500000 3.790000
75% NaN 35.500000 4.132500
max NaN 51.000000 4.800000
4. Pandas drawing——plot
4.1. Introduction to plot
DataFrame.plot(x=None, y=None, kind='line', ax=None, subplots=False,
sharex=None, sharey=False, layout=None, figsize=None,
use_index=True, title=None, grid=None, legend=True,
style=None, logx=False, logy=False, loglog=False,
xticks=None, yticks=None, xlim=None, ylim=None, rot=None,
fontsize=None, colormap=None, position=0.5, table=False, yerr=None,
xerr=None, stacked=True/False, sort_columns=False,
secondary_y=False, mark_right=True, **kwds)
The plot function is a function in Pandas that is used to plot data. It is implemented based on the Matplotlib library and can be used directly on Pandas Series and DataFrame objects, providing a simple and powerful data visualization function. The plot function can draw multiple types of charts, such as line charts, bar charts, scatter plots, box plots, etc. The specific chart type is specified by the kind parameter. Using the plot method of DataFrame to draw an image will draw a line for each column of the data. Curves, by default, display the legend at the appropriate location according to the names of the columns, which saves time compared to matplotlib drawing, and the data in DataFrame format is more standardized, convenient for vectorization and calculation.
4.2. Main parameters of plot
| Parameters | Description |
| — | — |
| kind | drawing type
line: line chart, bar: bar chart (stacked column chart when Stacked=True), barh: horizontal bar chart
hist: histogram, box: box plot, area: area chart (when Stacked=True, the positivity of each column must be the same)
kde: density chart, pie: pie chart, scatter: scatter chart, hexbin: honeycomb chart |
| subplots | Specify the number of rows and columns of the chart to create a matrix layout composed of multiple subplots
With layout, you can set the layout of sub-pictures |
| grid | Show or hide grid lines |
| legend | Legend, used to explain the meaning of each data series in the chart |
| logx | Use a logarithmic scale on the x-axis |
| logy | Use logarithmic scale on y-axis |
| loglog | Use logarithmic scale on both x- and y-axes |
| xlim&ylim | Set the scale range of x-axis and y-axis. You can pass a list or tuple containing two elements, representing the minimum and maximum values of the axis |
| fontsize | Set the font size in the chart |
| style / color / colors | Custom colors, different statistical charts may require different parameters |
| secondary_y | Set the y-axis on the right |
4.3. Line chart
4.3.1. Line chart
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
# Generate one thousand pieces of data that conform to the standard normal distribution(At this time, the data is gathered in0nearby),indexis the date,Column names are recorded asABCD
df = pd.DataFrame(np.random.randn(1000, 4), index=pd.date_range('1/1/2000', periods=1000), columns=list('ABCD'))
print(df)
print('-' * 30)
# Modify the data of the original array,so that they won't get crowded together0nearby
df = df.cumsum()
print(df)
print('-' * 30)
# df.plotUsed to build a chart object(AxesSubplot),row index doxaxis,The number of columns indicates how many pieces of data there are
# if row indexindexContains date,Pandas will be called automatically gct().autofmt_xdate() to formatxaxis
df.plot(kind='line', # line chart
grid=True, # show table
color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'], # Custom color
)
# Display existing chart objects
plt.show()
4.3.2. Multiple subgraphs
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
# Generate one thousand pieces of data that conform to the standard normal distribution(At this time, the data is gathered in0nearby),indexis the date,Column names are recorded asABCD
df = pd.DataFrame(np.random.randn(1000, 4), index=pd.date_range('1/1/2000', periods=1000), columns=list('ABCD'))
print(df)
print('-' * 30)
# Modify the data of the original array,so that they won't get crowded together0nearby
df = df.cumsum()
print(df)
print('-' * 30)
# df.plotUsed to build a chart object(AxesSubplot),row index doxaxis,The number of columns indicates how many pieces of data there are
# if row indexindexContains date,Pandas will be called automatically gct().autofmt_xdate() to formatxaxis
df.plot(kind='line', # line chart
grid=True, # show table
color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'], # Custom color
subplots=True # Each column of data is displayed in a subgraph
)
# Display existing chart objects
plt.show()
4.3.3. Multiple sub-pictures can be arranged arbitrarily
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
# Generate one thousand pieces of data that conform to the standard normal distribution(At this time, the data is gathered in0nearby),indexis the date,Column names are recorded asABCD
df = pd.DataFrame(np.random.randn(1000, 4), index=pd.date_range('1/1/2000', periods=1000), columns=list('ABCD'))
print(df)
print('-' * 30)
# Modify the data of the original array,so that they won't get crowded together0nearby
df = df.cumsum()
print(df)
print('-' * 30)
# df.plotUsed to build a chart object(AxesSubplot),row index doxaxis,The number of columns indicates how many pieces of data there are
# if row indexindexContains date,Pandas will be called automatically gct().autofmt_xdate() to formatxaxis
df.plot(kind='line', # line chart
grid=True, # show table
color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'], # Custom color
subplots=True, # Each column of data is displayed in a subgraph
layout=(2, 2) # subplot layout
)
# Display existing chart objects
plt.show()
4.3.4. Main and secondary axis settings
import numpy as np
import pandas as pd
from matplotlib.axes._subplots import Axes
from matplotlib import pyplot as plt
df = pd.DataFrame(np.random.randn(1000, 4), index=pd.date_range('1/1/2000', periods=1000), columns=list('ABCD'))
df = df.cumsum()
# When editing code,Pycharmhave no ideadf.plotWhat is returned?,Therefore it can be usedax:Axesto indicate,Use it like thisaxWhen the attribute method,PycharmThere will be a prompt
ax: Axes = df.plot(kind='line',
grid=True,
style=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'], # Custom color
secondary_y=['A', 'B'], # WillABColumn on secondary axis
)
# Use the methods of the chart object to modify the chart
ax.set_ylabel('CD scale') # Spindle name
ax.legend(loc='upper left') # The label of the main axis is shown in the upper left
ax.right_ax.set_ylabel('AB scale') # Secondary axis name
ax.right_ax.legend(loc='upper right') # The label of the secondary axis is shown in the upper right
# Display existing chart objects
plt.show()
4.4. Bar chart
4.4.1. Display Series data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
# Set Chinese font to Song Dynasty,Because we want to display text in the picture
matplotlib.rcParams['font.family'] = ['SimSun']
# Create 2x2 subgraph layout,Returns an array of canvas and sub-picture objects
fig, axes = plt.subplots(nrows=2, ncols=2)
# Create ten sets of data
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
# axes.flatReturn each subgraph object,enumerateCan return subscripts while traversing
for i, ax in enumerate(axes.flat):
df.iloc[i].plot(kind='bar', # bar chart
ax=ax, # Assign the result to the subgraph on the canvas
color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'], # Custom color
) # Custom color
ax.set_title('No. {} group data'.format(i)) # Set subgraph title
# show
plt.show()
4.4.2. Display DateFrame data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Create four data
df = pd.DataFrame(np.random.rand(4, 4), columns=['a', 'b', 'c', 'd'])
df.plot(kind='bar', color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'])
# show
plt.show()
Tip: For Series data, each label occupies one bar, and there will be separation between labels. For DataFrame data, each group of data occupies one bar, and there will be separation between each group of data.
4.4.3. Stacked bar chart
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Create four data
df = pd.DataFrame(np.random.rand(4, 4), columns=['a', 'b', 'c', 'd'])
df.plot(kind='bar',
color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'],
stacked=True)
# show
plt.show()
4.4.4. Horizontal bar chart
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Create four data
df = pd.DataFrame(np.random.rand(4, 4), columns=['a', 'b', 'c', 'd'])
df.plot(kind='barh',
color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'],
stacked=True)
# show
plt.show()
4.4.5, Histogram
Histograms are often used in the following situations:
① Display the distribution of data: Histograms can visually display the distribution characteristics of data, such as the center position of the data, degree of dispersion, skewness and kurtosis, etc.
② Check the data for outliers: Histograms can help identify outliers in the data, such as whether there are extreme values beyond the normal range.
③ Analyze the characteristics of the data: Histograms can help understand the characteristics of the data, such as the degree of data concentration and distribution shape, etc., thereby providing a reference for subsequent data analysis and modeling.
By observing the histogram, you can quickly understand the overall distribution of the data, which is helpful for further data analysis and decision-making.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Create four data
df = pd.DataFrame(np.random.rand(4, 4), columns=['a', 'b', 'c', 'd'])
print(df)
df.plot(kind='hist',
color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'])
# show
plt.show()
4.4.6, Box plot
Boxplots usually contain the following parts:
① Box: The box represents the interquartile range of the data. The upper boundary of the box is the third quartile of the data (Q3: 75%), and the lower boundary is the first quartile of the data. number (Q1: 25%), the horizontal line inside the box represents the median of the data (Q2: 50%), for example, Q1=10_25 for 65,70,72,75,78,80,82,85,88,90 %=2.5=>3rd position: 72; Q2=median=(75+78)/2=76.5; Q3=10_75%=7.5=>8th position: 85; therefore the range of the box accounts for half of the data 25%~75%
② Upper limit (Whisker): The upper limit and lower limit represent the boundaries of the data. They are usually the maximum and minimum values at the top and bottom of the box, but they can also vary according to the specific definition. Data points outside the upper and lower limits are called outliersBox plots are often used in the following situations:
① Observe the distribution of data: Box plots can help identify the median, quartiles and outliers of the data to understand the distribution characteristics of the data.
② Compare the distribution of data in different groups: Box plots can display the data distribution of multiple groups at the same time, making it easy to compare the differences and similarities between different groups.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Create ten pieces of data
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
print(df)
df.plot(kind='box')
# show
plt.show()
4.5. Area chart
4.5.1. Accumulation type
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Create ten pieces of data
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
print(df)
df.plot(kind='area',
color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'])
# show
plt.show()
4.5.2. Non-stacking type
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Create ten pieces of data
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
print(df)
df.plot(kind='area',
color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'],
stacked=False)
# show
plt.show()
4.6. Pie chart
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Create four pieces of data
df = pd.DataFrame(np.random.rand(4, 4), columns=['P1', 'P2', 'P3', 'P4'], index=['a', 'b', 'c', 'd'])
df.plot(kind='pie',
colors=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'], # Custom color
subplots=True,
layout=(2, 2),
figsize=(8, 6))
# show
plt.show()
4.7. Kernel density map
Kernel density estimation chart: A non-parametric method for estimating the probability density function. It can help us understand the distribution of the data. It is smoothed on the basis of the histogram. **, which uses a kernel function (usually a Gaussian kernel function) to estimate the probability density near each data point, and superimposes the estimated values of all kernel functions to finally obtain a continuous curve representing the probability density function of the data . Kernel density plots can provide smoother data distribution curves than histograms, helping us understand the distribution characteristics of the data more intuitively.
Install dependencies (use this module to calculate kernel density): pip install scipy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
# Create 2x2 subgraph layout,Returns an array of canvas and sub-picture objects
fig, axes = plt.subplots(nrows=1, ncols=2)
# create20Article data
df = pd.DataFrame(np.random.rand(20, 4), columns=['a', 'b', 'c', 'd'])
df2 = df.copy()
# Kernel density plot
df.plot(kind='kde',
ax=axes[0],
color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'],
figsize=(14, 6))
# The corresponding histogram
df2.plot(kind='hist',
bins=20,
ax=axes[1],
color=['#79CDCD', '#EEB4B4', '#6CA6CD', '#FFA54F'],
figsize=(14, 6))
# show
plt.show()
5. Read and write files
5.1. What is CSV?
CSV (Comma-Separated Values) is a commonly used text file format used to store tabular data. In a CSV file, data is separated by commas or other delimiters (such as semicolons, tabs, etc.). Each row represents a row of data records in the table, and each field represents an attribute value in the record. CSV files usually end with . csv extension, it is a universal data exchange format that can be supported by almost all data processing software and programming languages
5.2. Advantages of CSV files
① Easy to use: CSV files store data in plain text format, are easy to create and edit, and are highly versatile.
② Cross-platform compatibility: CSV files can be used for data exchange between different operating systems and software, and have good cross-platform compatibility
③ Lightweight: Compared with other data storage formats (such as Excel files), CSV files are usually more lightweight and have high storage efficiency.
5.3. Disadvantages of CSV files
① Does not support complex data structures: CSV files only support the storage of flat table data and do not support complex data structures (such as nested tables, multi-table associations, etc.)
② Does not support metadata: CSV files do not support metadata information for storing data, such as data types, indexes, etc., and require additional processing to resolve.
5.4. Read CSV files – read_csv
① Commonly used parameters
pd.read_csv(filepath_or_buffer, # file name
sep=',', # Separator between fields,Default is comma ,
delimiter=None, # Same functionsep,If specified at the same time sep and delimiter,Then take delimiter Subject to
header='infer', # row number of column name。The default value is 'infer',express Pandas will attempt to automatically infer column names
names=None, # Specify a custom list of column names,if header=None,then pass names Parameter specifies column name
index_col=None, # Specify which column to use as the index column
usecols=None) # Specify the columns to read
② Example
#writeperson.csvdocument
ID,Name,Age,City,Salary
1,Jack,28,Beijing,22000
2,Lida,32,Shanghai,19000
3,John,43,Shenzhen,12000
4,Helen,38,Hengshui,3500
import pandas as pd
from pandas import DataFrame
import numpy as np
# read directly,will get aDataFrameobject
df1: DataFrame = pd.read_csv('./person.csv', # read file
header=0, # Delete the original column name
names=['i', 'n', 'a', 'c', 's'], # Custom column names
dtype={'s': np.float64} # Retype a field
)
print(df1)
print('-' * 30)
# View the type of each column
print(df1.dtypes)
print('-' * 30)
# Select a field for indexing
df2 = pd.read_csv('./person.csv', # file name
index_col=0, ) # Select the first column as index,You can also writeindex_col=['ID']
print(df2)
print('-' * 30)
# fromcsvStart reading the specified line of the file
df3 = pd.read_csv('./person.csv', # file name
skiprows=2) # Skip the first two lines before starting to read
print(df3)
#operation result
i n a c s
0 1 Jack 28 Beijing 22000.0
1 2 Lida 32 Shanghai 19000.0
2 3 John 43 Shenzhen 12000.0
3 4 Helen 38 Hengshui 3500.0
------------------------------
i int64
n object
a int64
c object
s float64
dtype: object
------------------------------
Name Age City Salary
ID
1 Jack 28 Beijing 22000
2 Lida 32 Shanghai 19000
3 John 43 Shenzhen 12000
4 Helen 38 Hengshui 3500
------------------------------
2 Lida 32 Shanghai 19000
0 3 John 43 Shenzhen 12000
1 4 Helen 38 Hengshui 3500
5.5. Write CSV file——to_csv
① Commonly used parameters
df.to_csv(path_or_buf, # to be written CSV The path to the file or a file-like object
sep=",", # Specifies the delimiter between fields
na_rep="", # String to replace missing values
columns=None, # Column to write to,Can be a list of column names,Specify column order for writing,It can also be a dictionary,for renaming columns
header=True, # Whether to write column names
index=True, # Whether to write to the index
mode="w", # file open mode
encoding="utf-8", # coding
line_terminator=r"\n", # line terminator
date_format=None, # date format
chunksize=None, # Block size when writing
)
② Example
import pandas as pd
data = {'Name': ['Smith', 'Parker'],
'ID': [101, 102],
'Language': ['Python', 'JavaScript']}
df = pd.DataFrame(data)
print(df)
print('-' * 30)
# Convert tocsvdata
csv_data = df.to_csv(sep='|')
print(csv_data)
#operation result
Name ID Language
0 Smith 101 Python
1 Parker 102 JavaScript
------------------------------
|Name|ID|Language
0|Smith|101|Python
1|Parker|102|JavaScript
5.6. Write Excel file——to_excel
Install dependencies: pip install openpyxl
① Commonly used parameters
df.to_excel(excel_writer, # to be written Excel The path to the file or a ExcelWriter object
sheet_name='Sheet1', # The name of the worksheet to write to
na_rep='', # String to replace missing values
float_format=None, # Specify the format of floating point numbers
columns=None, # Column to write to
header=True, # Specifies whether to write column names
index=True, # Specifies whether to write to the index
index_label=None, # Specify the name of the index column
startrow=0, # Starting row of data writing
startcol=0, # Starting column for data writing
engine=None, # Engine used
merge_cells=True, # Whether to merge cells
encoding=None, # file encoding
inf_rep='inf', # Replace strings with positive and negative infinity values
verbose=True, # Whether to display detailed information during the writing process
freeze_panes=None) # Specify panes to freeze
② Example
import pandas as pd
from openpyxl.styles import Font, Alignment
from openpyxl.utils import get_column_letter
font = Font(name='Microsoft YaHei', size=11) # Font style
alignment = Alignment(horizontal='center', vertical='center') # Font layout
# createDataFramedata
df = pd.DataFrame({'name': ['Programming help', 'cLanguage Chinese Network', 'microgakuen', '92python'],
'rank': [1, 2, 3, 4],
'language': ['PHP', 'C', 'PHP', 'Python'],
'url': ['www.bianchneg.com', 'c.bianchneg.net', 'www.weixueyuan.com', 'www.92python.com']})
# createExcelWriteobject
writer = pd.ExcelWriter('./website.xlsx')
# Pass data in
df.to_excel(writer)
# Revisexlxsstyles in
book = writer.book # excelsurface
sheet = book.active # worksheet
for row in sheet.iter_rows(min_row=1, min_col=1, max_row=sheet.max_row, max_col=sheet.max_column):
for cell in row:
cell.font = font # Set the font to Microsoft YaHei 11Number
cell.alignment = alignment # center
for i, col in enumerate(df.columns):
max_length = max(df[col].astype(str).map(len).max(), len(col)) # Calculate the maximum length of each column
sheet.column_dimensions[get_column_letter(i + 2)].width = max_length + 8 # Set column width
# keep
writer.save()
5.7. Read Excel files – read_excel
① Commonly used parameters
pd.read_excel(io, # excelfile path
sheet_name=0, # excelworksheet name
header=0, # Specify row as column name,If it is not specified, it can be setheader=None,header=2Indicates using the first two rows as multiple indexes
names=None, # Custom column names
index_col=None, # Specify the data of a certain column as the row index
usecols=None, # Column to read,The default isNoneIndicates to read all columns
squeeze=False, # whether it will be returned DataFrame object converted to Series object
dtype=None, # Specify the data type of the column
engine=None, # parsing engine used
converters=None, # Conversion function for specified column
true_values=None, # String representation of the specified Boolean type,The default isTrue
false_values=None, # String representation of the specified Boolean type,The default isFalse
skiprows=None, # skip specified number of lines
nrows=None, # number of lines to read
na_values=None, # String representation of missing values
parse_dates=False, # Whether to parse date type data into datetime object
date_parser=None, # Function to parse date data
thousands=None, # Character that specifies thousands separator
comment=None, # Specify the character of the comment
skipfooter=0, # To ignore certain lines at the end
convert_float=True) # Whether to convert integer type columns to floating point type
② Example
import pandas as pd
from pandas import DataFrame
# readExcelAll data of the file
df1: DataFrame = pd.read_excel('./website.xlsx', # path
engine='openpyxl', # parsing engine
index_col="name", # usenameThis column serves as the index
)
df1.columns = df1.columns.str.replace('Unnamed.*', 'col_label') # WillexcelneutralindexThe column name is empty,converted todfThen the empty column name becomes‘Unnameed:’,So you can rename it
print(df1)
print('-' * 30)
# readExcelPart of the file data+double index
df2: DataFrame = pd.read_excel('./website.xlsx', # path
engine='openpyxl', # parsing engine
index_col=[1, 2], # Use two columns for indexing(from0start)
usecols=[1, 2, 3] # Selectexcelthree columns in(from0start)
)
df2.columns = df2.columns.str.replace('Unnamed.*', 'col_label') # WillexcelneutralindexThe column name is empty,converted todfThen the empty column name becomes‘Unnameed:’,So you can rename it
print(df2)
#operation result
col_label rank language url
name
Programming help 0 1 PHP www.bianchneg.com
cLanguage Chinese Network 1 2 C c.bianchneg.net
microgakuen 2 3 PHP www.weixueyuan.com
92python 3 4 Python www.92python.com
------------------------------
name
rank language
1 PHP Programming help
2 C cLanguage Chinese Network
3 PHP microgakuen
4 Python 92python
6. Data table interconnection, grouping aggregation, and data pivoting
6.1. Data table interconnection
Method | Description |
---|---|
merge | Used to join rows in different DataFrames based on one or more keys (column values) |
concat | Used to concatenate multiple DataFrames along a specified axis |
join | Used to join two DataFrames based on their index (row label) |
append | Used to add a DataFrame to the end of another DataFrame |
import pandas as pd
# merge:joint table(Based on keywords/List)
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'], 'value': [1, 2, 3, 4]})
df2 = pd.DataFrame({'key': ['B', 'D', 'E', 'F'], 'value': [5, 6, 7, 8]})
merged_df = pd.merge(df1, df2, on='key', how='outer') # outerTake union
print(merged_df)
print('-' * 30)
# concat:Merge along specified axis(portrait/Horizontal)
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'], 'B': ['B4', 'B5', 'B6', 'B7']})
concatenated_df1 = pd.concat([df1, df2], axis=0)
print(concatenated_df1)
print('-' * 30)
concatenated_df2 = pd.concat([df1, df2], axis=1)
print(concatenated_df2)
print('-' * 30)
# join:linked list(based onindex/OK)
df1 = pd.DataFrame({'value1': [1, 2, 3, 4]}, index=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame({'value2': [5, 6, 7, 8]}, index=['B', 'D', 'E', 'F'])
joined_df = df1.join(df2, how='outer')
print(joined_df)
print('-' * 30)
# append:vertical merger
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'], 'B': ['B4', 'B5', 'B6', 'B7']})
appended_df = df1.append(df2)
print(appended_df)
key value_x value_y
0 A 1.0 NaN
1 B 2.0 5.0
2 C 3.0 NaN
3 D 4.0 6.0
4 E NaN 7.0
5 F NaN 8.0
------------------------------
A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
0 A4 B4
1 A5 B5
2 A6 B6
3 A7 B7
------------------------------
A B A B
0 A0 B0 A4 B4
1 A1 B1 A5 B5
2 A2 B2 A6 B6
3 A3 B3 A7 B7
------------------------------
value1 value2
A 1.0 NaN
B 2.0 5.0
C 3.0 NaN
D 4.0 6.0
E NaN 7.0
F NaN 8.0
------------------------------
A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3
0 A4 B4
1 A5 B5
2 A6 B6
3 A7 B7
6.2. Group aggregation
Method | Description |
---|---|
groupby | used to group data |
agg | Used to aggregate grouped data |
import pandas as pd
# Create an example DataFrame
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo'],
'B': ['one', 'one', 'two', 'two', 'one'],
'C': [1, 2, 3, 4, 5]})
# by column 'A' group
grouped = df.groupby('A')
# Display grouped data
for name, group in grouped:
print(name)
print(group)
print('-' * 30)
# Aggregate the grouped data,Calculate the mean and sum of each group
aggregated = grouped.agg({'C': ['mean', 'sum']})
print(aggregated)
bar
A B C
1 bar one 2
3 bar two 4
------------------------------
foo
A B C
0 foo one 1
2 foo two 3
4 foo one 5
------------------------------
C
mean sum
A
bar 3 6
foo 3 9
6.3. Data perspective
Pivot: A data analysis technique used to summarize and analyze large amounts of data to better understand relationships and trends between data. Data pivot can simplify complex data structures into easy-to-understand forms, providing intuitive understanding and in-depth insights into the data.
Pivot Table: A form of implementation of pivot technology. It displays data in a multi-dimensional and multi-level manner by summarizing, grouping and aggregating data, thereby revealing the intrinsic relationships and relationships between data. law
df.pivot_table(
index=None, # Column or column name used for grouping,The column to be used as the row index in the result。
columns=None, # Column or column name used for grouping,The column that will be used as the column index in the results。
values=None, # Column or column name to be aggregated,i.e. the data to be displayed in the results。
aggfunc="mean" # Specify a function that performs aggregation calculations on data,For example sum、mean、count wait,The default is numpy.mean
)
import pandas as pd
# Create an example DataFrame
df = pd.DataFrame({'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
'B': ['one', 'one', 'two', 'two', 'one', 'one'],
'C': [1, 2, 3, 4, 5, 6],
'D': [7, 8, 9, 10, 11, 12]})
print(df)
print('-' * 30)
# use pivot_table Create a PivotTable
pivot = df.pivot_table(index='A', # rightAMake groups,AIndex as pivot table
columns='B', # rightBMake groups,BAs PivotTable columns
values='C', # rightCColumn aggregation
aggfunc='sum') # Count each group,For example(bar,one),ofCSum
print(pivot)
A B C D
0 foo one 1 7
1 foo one 2 8
2 foo two 3 9
3 bar two 4 10
4 bar one 5 11
5 bar one 6 12
------------------------------
B one two
A
bar 11 4
foo 3 3