New York city Taxi Drive duration
To improve the efficiency of taxi dispatching systems for such services, it is important to be able to predict how long a driver will have his taxi occupied. If a dispatcher knew approximately when a taxi driver would be ending their current ride, they would be better able to identify which driver to assign to each pickup request. This project is to build a model that predicts the total ride duration of taxi trips in New York City.
- Dictionary
Let's check the data files! According to the data description we should find the following columns:
- id: a unique identifier for each trip.
- vendor_id:a code indicating the provider associated with the trip record
- pickup_datetime:date and time when the meter was engaged.
- dropoff_datetime:date and time when meter was disengaged.
- passenger_count:the number of passengers in the vehicle (driver entered value)
- pickup_longitude:the longitud where the meter eas engaged
- pickup_latitude: the latitude where the meter was engaged
- dropoff_longitude: the longitude where the meter was disengaged
- dropoff_latitude: the latitude where the meter was disengaged
- store_and_fwd_flag: this flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server (Y=store and forward; N=not a store and forward trip)
- trip_duration: (target) duration of the trip in seconds
Here, we have 2 variables dropoff_datetime and store_and_fwd_flag which are not available before the trip starts and hence will not be used as features to the model.
- Data Analysis
Here, you can find googlee Colab notebook: Data Analysis of NYC taxi trip duration.
- Conclusions:
- The majority of rides follow a rather smooth distribution that looks almost log-normal with a peak just around exp(6.5) i.e. about 17 minutes.
- There are several suspiciously short rides with less than 10 seconds duration.
- As discussed earlier, there are a few huge outliers near 12.
- Most of the trips involve only 1 passenger. There are trips with 7-9 passengers but they are very low in number.
- Vendor 2 has more number of trips as compared to vendor 1
- Number of pickups for weekends is much lower than week days with a peak on Thursday (4). Note that here weekday is a decimal number, where 0 is Sunday and 6 is Saturday.
- Number of pickups as expected is highest in late evenings. However, it is much lower during the morning peak hours.
- We see that most trips are concentrated between these lat long only with a few significant clusters. These clusters are represented by the numerous peaks in the lattitude and longitude histograms
- Trip durations are definitely shorter for late night and early morning hours that can be attributed to low traffic density
- It follows a similar pattern when compared to number of pickups indicating a correlation between number of pickups and trip duration
- Median trip duration does not vary much as can be seen from the above plot for different vendors.
- The boxplot clearly shows that there not much of a difference in distribution for the most frequently occuring passenger count values - 1, 2, 3.
- Another key observation is that the number of outliers are reduced for higher passenger counts but that only comes down to the individual frequencies of each passenger count.
- From the correlation heatmap we see that the lattitude and longitude features have higher correlation with the target as compared to the other features.