NYC Taxi Trip Duration | Alejandro Pinto

// 01 — data dictionary

Feature Dictionary

Note: dropoff_datetime and store_and_fwd_flag are not available before the trip starts and therefore excluded from model features.

Unique identifier for each trip

vendor_id

Code indicating the provider associated with the trip

pickup_datetime

Date and time when the meter was engaged

dropoff_datetime

Date and time when meter was disengaged

passenger_count

Number of passengers (driver entered)

pickup_longitude

Longitude where the meter was engaged

pickup_latitude

Latitude where the meter was engaged

dropoff_longitude

Longitude where the meter was disengaged

dropoff_latitude

Latitude where the meter was disengaged

store_and_fwd_flag

Whether trip was stored in memory before sending (Y/N)

trip_duration ★ target

Duration of the trip in seconds

// 02 — key findings

Majority of rides follow a near log-normal distribution with a peak around exp(6.5) — roughly 17 minutes.
Several suspiciously short rides exist with less than 10 seconds duration.
A few huge outliers near log-duration of 12.
Most trips involve only 1 passenger; trips with 7–9 passengers are very rare.
Vendor 2 has more trips than Vendor 1.
Weekend pickups are much lower than weekdays, with a peak on Thursday.
Pickup volume is highest in late evenings, lower during morning peak hours.
Most trips are concentrated in a few significant geographic clusters, visible in lat/long histograms.
Trip durations are shorter during late night and early morning — correlating with lower traffic density.
Pickup count and trip duration show a positive correlation by hour.
Median trip duration does not vary significantly by vendor.
Passenger count 1, 2, and 3 have similar duration distributions. Higher counts have fewer outliers.
From the correlation heatmap, latitude and longitude features show higher correlation with trip duration than other features.