Search
Aggregation
import pandas as pd

Named Aggregation

Getting multiple statistics per group operation, new in pandas 0.25

bikes = pd.read_csv('../data/indego-trips.csv')
(
    bikes
    .groupby('bike_id')
    .agg(
        min_duration=pd.NamedAgg(column='duration', aggfunc='min'),
        max_duration=pd.NamedAgg(column='duration', aggfunc='max'),
        med_duration=pd.NamedAgg(column='duration', aggfunc='median'),
        avg_duration=pd.NamedAgg(column='duration', aggfunc='mean'),
        total_trips=pd.NamedAgg(column='trip_id', aggfunc='count')
    )
    .reset_index()
    .head()
)
bike_id min_duration max_duration med_duration avg_duration total_trips
0 42 60 60 60.0 60.000000 1
1 44 33 844 76.0 317.666667 3
2 47 3 12 7.5 7.500000 2
3 102 3 55 10.0 12.921053 38
4 170 1 78 10.0 15.276316 76

Usual Aggregation

We can access groups from a groupby object

(
    bikes
    .groupby('start_station')
    .get_group(3020)
    .head()
)
trip_id duration start_time end_time start_station start_lat start_lon end_station end_lat end_lon bike_id plan_duration trip_route_category passholder_type bike_type
0 326702093 5 2019-10-01 00:07:03 2019-10-01 00:12:09 3020 39.949219 -75.190361 3163 39.949741 -75.180969 19707 365 One Way Indego365 electric
34 326702057 1 2019-10-01 01:29:38 2019-10-01 01:30:09 3020 39.949219 -75.190361 3020 39.949219 -75.190361 2647 30 Round Trip Indego30 standard
126 326701961 6 2019-10-01 06:05:17 2019-10-01 06:11:41 3020 39.949219 -75.190361 3170 39.944260 -75.181343 11788 30 One Way Indego30 standard
145 326701941 5 2019-10-01 06:23:20 2019-10-01 06:28:19 3020 39.949219 -75.190361 3006 39.952202 -75.203110 14513 30 One Way Indego30 standard
264 326701822 19 2019-10-01 07:21:12 2019-10-01 07:40:15 3020 39.949219 -75.190361 3158 39.925522 -75.169037 5345 30 One Way Indego30 standard

Another way to do multiple aggregation. Actually, I think it's just shorthand NamedAgg

(
    bikes
    .groupby('bike_id', as_index=False)
    .agg(
        min_duration=('duration', 'min'),
        max_duration=('duration', 'max'),
        med_duration=('duration', 'median'),
        avg_duration=('duration', 'mean'),
        total_trips=('trip_id', 'count')
    )
    .head()
)
min_duration max_duration med_duration avg_duration total_trips
0 60 60 60.0 60.000000 1
1 33 844 76.0 317.666667 3
2 3 12 7.5 7.500000 2
3 3 55 10.0 12.921053 38
4 1 78 10.0 15.276316 76