// 01 — data dictionary
Feature Dictionary
Note: dropoff_datetime and store_and_fwd_flag are not available before the trip starts and therefore excluded from model features.
id
Unique identifier for each trip
vendor_id
Code indicating the provider associated with the trip
pickup_datetime
Date and time when the meter was engaged
dropoff_datetime
Date and time when meter was disengaged
passenger_count
Number of passengers (driver entered)
pickup_longitude
Longitude where the meter was engaged
pickup_latitude
Latitude where the meter was engaged
dropoff_longitude
Longitude where the meter was disengaged
dropoff_latitude
Latitude where the meter was disengaged
store_and_fwd_flag
Whether trip was stored in memory before sending (Y/N)
trip_duration ★ target
Duration of the trip in seconds
// 02 — key findings
Conclusions
- Majority of rides follow a near log-normal distribution with a peak around exp(6.5) — roughly 17 minutes.
- Several suspiciously short rides exist with less than 10 seconds duration.
- A few huge outliers near log-duration of 12.
- Most trips involve only 1 passenger; trips with 7–9 passengers are very rare.
- Vendor 2 has more trips than Vendor 1.
- Weekend pickups are much lower than weekdays, with a peak on Thursday.
- Pickup volume is highest in late evenings, lower during morning peak hours.
- Most trips are concentrated in a few significant geographic clusters, visible in lat/long histograms.
- Trip durations are shorter during late night and early morning — correlating with lower traffic density.
- Pickup count and trip duration show a positive correlation by hour.
- Median trip duration does not vary significantly by vendor.
- Passenger count 1, 2, and 3 have similar duration distributions. Higher counts have fewer outliers.
- From the correlation heatmap, latitude and longitude features show higher correlation with trip duration than other features.