// 01 — data dictionary

Feature Dictionary

Note: dropoff_datetime and store_and_fwd_flag are not available before the trip starts and therefore excluded from model features.

id
Unique identifier for each trip
vendor_id
Code indicating the provider associated with the trip
pickup_datetime
Date and time when the meter was engaged
dropoff_datetime
Date and time when meter was disengaged
passenger_count
Number of passengers (driver entered)
pickup_longitude
Longitude where the meter was engaged
pickup_latitude
Latitude where the meter was engaged
dropoff_longitude
Longitude where the meter was disengaged
dropoff_latitude
Latitude where the meter was disengaged
store_and_fwd_flag
Whether trip was stored in memory before sending (Y/N)
trip_duration ★ target
Duration of the trip in seconds
View Notebook — NYC Taxi EDA
// 02 — key findings

Conclusions

  • Majority of rides follow a near log-normal distribution with a peak around exp(6.5) — roughly 17 minutes.
  • Several suspiciously short rides exist with less than 10 seconds duration.
  • A few huge outliers near log-duration of 12.
  • Most trips involve only 1 passenger; trips with 7–9 passengers are very rare.
  • Vendor 2 has more trips than Vendor 1.
  • Weekend pickups are much lower than weekdays, with a peak on Thursday.
  • Pickup volume is highest in late evenings, lower during morning peak hours.
  • Most trips are concentrated in a few significant geographic clusters, visible in lat/long histograms.
  • Trip durations are shorter during late night and early morning — correlating with lower traffic density.
  • Pickup count and trip duration show a positive correlation by hour.
  • Median trip duration does not vary significantly by vendor.
  • Passenger count 1, 2, and 3 have similar duration distributions. Higher counts have fewer outliers.
  • From the correlation heatmap, latitude and longitude features show higher correlation with trip duration than other features.