New York city Taxi Drive duration

To improve the efficiency of taxi dispatching systems for such services, it is important to be able to predict how long a driver will have his taxi occupied. If a dispatcher knew approximately when a taxi driver would be ending their current ride, they would be better able to identify which driver to assign to each pickup request. This project is to build a model that predicts the total ride duration of taxi trips in New York City.

Dictionary

Let's check the data files! According to the data description we should find the following columns:

id: a unique identifier for each trip.
vendor_id:a code indicating the provider associated with the trip record
pickup_datetime:date and time when the meter was engaged.
dropoff_datetime:date and time when meter was disengaged.
passenger_count:the number of passengers in the vehicle (driver entered value)
pickup_longitude:the longitud where the meter eas engaged
pickup_latitude: the latitude where the meter was engaged
dropoff_longitude: the longitude where the meter was disengaged
dropoff_latitude: the latitude where the meter was disengaged
store_and_fwd_flag: this flag indicates whether the trip record was held in vehicle memory before sending to the vendor because the vehicle did not have a connection to the server (Y=store and forward; N=not a store and forward trip)
trip_duration: (target) duration of the trip in seconds

Here, we have 2 variables dropoff_datetime and store_and_fwd_flag which are not available before the trip starts and hence will not be used as features to the model.

Data Analysis

Here, you can find googlee Colab notebook: Data Analysis of NYC taxi trip duration.

Conclusions:

The majority of rides follow a rather smooth distribution that looks almost log-normal with a peak just around exp(6.5) i.e. about 17 minutes.
There are several suspiciously short rides with less than 10 seconds duration.
As discussed earlier, there are a few huge outliers near 12.
Most of the trips involve only 1 passenger. There are trips with 7-9 passengers but they are very low in number.
Vendor 2 has more number of trips as compared to vendor 1
Number of pickups for weekends is much lower than week days with a peak on Thursday (4). Note that here weekday is a decimal number, where 0 is Sunday and 6 is Saturday.
Number of pickups as expected is highest in late evenings. However, it is much lower during the morning peak hours.
We see that most trips are concentrated between these lat long only with a few significant clusters. These clusters are represented by the numerous peaks in the lattitude and longitude histograms
Trip durations are definitely shorter for late night and early morning hours that can be attributed to low traffic density
It follows a similar pattern when compared to number of pickups indicating a correlation between number of pickups and trip duration
Median trip duration does not vary much as can be seen from the above plot for different vendors.
The boxplot clearly shows that there not much of a difference in distribution for the most frequently occuring passenger count values - 1, 2, 3.
Another key observation is that the number of outliers are reduced for higher passenger counts but that only comes down to the individual frequencies of each passenger count.
From the correlation heatmap we see that the lattitude and longitude features have higher correlation with the target as compared to the other features.