Used Cars EDA (Cars: 3)

Published on

This is the third part of an ongoing series of posts on a car data analysis project project. The project consists in the gathering of data about used cars, analysis and modelling of the price as the target variable.

This part will deal mostly with the so called EDA (exploratory data analysis). This Jupyter Notebook is heavily commented ad you can find it on my GitHub, here. As in the previous posts, on my site I just kind of walk you through my analysis and emphasize the key points. Let’s go, this is an interesting ride.

The jupyter notebook is available here.

Automobile!

Problem formulation

The problem at hand isn’t really new and it has been tackled numerous times. My guess is that the need for a real system for estimating the prices of used cars isn’t the primary force behind this quest. Rather, I believe it is the availability and sheer amount of highly structured data that is available with just a little bit of scraping in every part of the world. Also, we are all car experts, aren’t we?

Personally, I really need to part with my old car - a beautiful silver Fiat Stilo with just under 290.000km on it and I have been skimming over the ads for the last couple of months (ok, years), only to see the average price precipitate and further demotivate me to get anything done really. There comes a point in a car’s life when it really isn’t worth selling anymore, you know that, right? Well, I won’t get into that any further, but let’s talk a bit about the data before we take a peek at it.

The problem I want to solve is the following: given a brand, model, production year, odometer, engine type etc. try to predict as accurately as possible the price of the car.

Project outline

Looking at this chart, we are in the green area now.

It sounds pretty simple and really, it is - I just need a number, a number of euros to be precise. I need a system (mathematical, software, programmatical) that will take these inputs, do some magic, and output a prediction or an estimate (although, estimate wouldn’t be a wise choice of word in this particular case, because it could get confused with the concept of statistical estimation, and we do not want that).

This particular notebook will deal with the exploration of the scraped data, the so called process of EDA - Exploratory Data Analysis. A really fancy name for some, as it turns out, pretty simple mathematical operations.

The initial import is the following:

%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib
import seaborn as sns

color = sns.color_palette()

import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (15.0, 10.0)

from scipy import stats
from scipy.stats import norm, skew #for some statistics

from pandas_profiling import ProfileReport

color_green='#46b038'

Pandas profiling

Pandas Profiling is a very interesting and useful package for pandas that allows us to quickly visualize the data - a special report actually. The report is full of useful information about the columns, the types of variables, the distributions and skewness etc. I imagine it would be a must have for any competition like Kaggle or project on a tight deadline. Tip: try it out.

The report gives us a nice overview of all the variables and accelerates the process of really getting to know your data. However there is no way in avoiding the most fun part: diving in and examining the variables one by one. We’ll start with the most important - the price, our target variable.

Price

count   7323.000000
mean   5517.203469
std    8658.753195
min      2.000000
25%     2000.000000
50%     3499.000000
75%     6900.000000
max 499990.000000

While the car advertised for half a million euros does actually exist (spoiler: it is a Lambo), the two-euro cars do not. This is an ongoing issue with data validation and users and web forms.

While this could be a case for further investigation, I will just cut left and right and basically say that only cars priced from 500 euros to 30.000 euros are retained. It is a deliberate decision and the beauty of jupyter notebooks is that it could be easily changed or, even better, parametrized in a function.

raw_data = raw_data[raw_data.price.between(500,30000)]

After this initial filter, we are seeing gibberish in the other columns, but now we are dealing only with the price variable. The next thing to check is the shape of the distribution ot a histogram. This Kaggle notebook is incredibly useful and thorough and I have stolen more than one trick from it. One of those is the following way to plot the price histogram.

sns.distplot(raw_data.price, fit=norm);
(mu, sigma) = norm.fit(raw_data['price'])

plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('Price distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(raw_data['price'], plot=plt)
plt.show()

Prices distribution

Theoretical quantiles

Skewness and kurtosis

The aformentioned Kaggle notebook states: >The target variable is right skewed. As (linear) models love normally distributed data , we need to transform this variable and make it more normally distributed.

This is something I want to leave for later, since I want to save the data without this transformation first.

Year variable

Plotting a histogram of the year-variable distribution yields us:

Year Histogram

I am applying the same type of filter, retaining cars made after 1985 and before 2020, inclusive. After that transformation, this is the result:

Year Histogram After Cutoff

We could say that this is a bimodal distribution: we can see two spikes - around the 2006 and then 2015. Logically this makes sense: in the majority of cases the cars are either sold after the fist 4-5 years after being acquired and after the end of the warranty period, or when they begin to break down (after 15 years). Unfortunately, my Stilo 2006 falls into the second category - the big spike.

Kilometers / odometer / mileage

This is probably the most controversial and most faked variable in this dataset. Balkan countries, especially thos which aren’t (yet) members of the EU are known for having hundreds if not thousands of cars with tampered odometers. It’s one of those things - everybody knows it, the buyers know it, the sellers know it and the authorities know it, but nobody is doing anything about it.

At it’s most basic level of observation, the distribution of mileages should be normal - follow the age distribution. It is, however, a fictitious world - a world in which every car gets to be driven approximately the same way, a world filled with honest resellers and trustworthy owners…

I won’t show the initial histogram: it includes some values that exceed 3 millions. Having to draw a line, I decided that my line was 3000 km minimum and 450.000 km for the maximum.

The resulting histogram is the following:

Km Histogram

I should say that I counted the data points that I lost at each step and these are all data points that could have been saved or repaired, probably. We are bleeding data points here and I am well aware of it. All of these filters could be and probably should be defined in a more granular and precise way. I am going to keep it for now, but this is like a second checkpoint - a deliberate choice that was made that could potentially affect the results.

I replaced the year variable with a simpler and more intuitive age variable:

raw_data['age']=2020-raw_data['year']
raw_data = raw_data.drop(['year'],axis=1)

Kws and cm3s

These two variables - both numerical - are highly correlated: they define the power and the displacement of the engine. I could probably keep just one of them, but I was hoping that their ratio could give me some additional insight into the cars dataset.

Newer car models generally manage to squeeze more power out of the same engine displacement (in cm3 ) so a bigger ratio kW/cm3 theoretically should indicate newer cars…

The two variables are, however, almost completely correlated and should probably be replaced with just one of them.

As usual, I defined a couple of cutoff values and applied them to both columns. This is, again, totally arbitrary… I just took something that I consider to be a really big car and checked it’s engine to see what could be a reasonable maximum. For the minimum, I settled with a Yugo. Audi A8 quattro - 338 kW and almost 4000 cm3.

For the Yugo, 33kW and 900 cm3 seem to be the specs. Anyway, I took these two models as the two ends of my car spectrum. The resulting histograms are these:

KW Histogram

Cubic Centimeters Histogram

This second plot makes perfect sense: again we have a bimodal distribution with the vast majority of cars uses engines with app. 1900-2000 cm3, or 1.6 litres and these two groups give us these spikes.

Scatter plots for numerical data

Next thing, I plotted some scatter plots of the numerical data, hoping to find some nice visual correlations similar to the famous R Cars dataset.

A plot of Age vs Price with the colors being assigned to the fuel type, gives us this:

Scatter Plot: Price/Age/Fuel

As expected: the price falls pretty fast with the car age. We could further analyze this - it would be nice to compare how different car brands become cheaper with time… And is it true that Volkswagens are champions?

Plotting the same thing, just using the odometer on the x-axis yields us:

Scatter Plot: Price/Km/Fuel

Diesels dominate the Serbian market, LPG-enabled cars are cheap, the priciest cars are diesel-powered… These are some random conclusions that can be drawn from the plots.

Finally, plotting kms vs age, gives us this beauty:

Scatter Plot: Age/Km/Fuel

What does this tell us? Well’ pretty much the same story: diesels make more kilometers and they dominate the left area of the chart (younger cars). The far right part is sparse and less significant - cars older than twenty years tend to have a more uniform distribution.

What is visible form these charts is the fact that the country obviously doesn’t subsidize LPG cars as other countries do - there are very few green cars in the left part of the chart.

Finally, I did a correlation heatmap with some nice Seaborn colors:

Correlation Heatmap

I find correlation heatmaps often confusing, but here we do not have too many moving parts. The km and age are bright green as they should be (I am looking at the price column or row), kW are pretty dark (blue) as they are positively correlated, the standard as well…

Categorical variables

After having dealt with the numerical variables, let’s move on to the categorical ones - variables that have a binary truthy/falsey value or can take a discrete/finite number of values.

gearbox

In Serbia we mainly drive manual, but there are some automatic gear enabled cars, especially in the upper tier. I won’t go into the details, but I will just say that over 5500 cars have a manual gear. Pricewise, this is the dependance that I found:

Mean Pirce per Gearbox Type

I repeat, although this looks wonderfully linear and in order, the number of non-manual cars is almost neglectable, so this is not such a valuable piece of information as it might seem…

Answering some questions

It would be silly to pass such a great opportunity to query the dataset and try to obtain some answers that could be potentially interesting:

  • who is the most popular manufacturer?
  • what about the models?
  • petrol or diesel?
  • average mileage per car type etc.?

No surprises here…

Top 10 Most Popular Brands in Serbia 2020

This is what we already established in the previous notebook - VW is the king, Opel is the Prince, followed by two french car producers, with my Fiat placing in fifth. Let’s see the models:

Top 10 Most Popular makes in Serbia 2020

Well, keep in mind that this is just around 260 Astras on 6000 cars… But still, Astras are pretty popular. This is however of very little significance as the list is dominated by models that span several decades: Astra, Passat, Focus, Corsa, Clio… VW Golf, which is indubitably a more popular model than all the others - in fact there are 322 Golfs between the Golf 1 and… I don’t know, Golf 7, 8? The fact is that they are listed as separate models, while all the Astras are… just Astras. We need to pay special attention to this if we need to make a precise and trustworthy report!

Just to make a point, let’s take a look at the top FIAT models:

Top 10 Most Popular FIAT models in Serbia 2020

This also makes sense: Punto is a very old model with a reputation of being cheap to maintain and reasonably reliable. It was made in Serbia for a couple of years and there are still a lot of them on the roads. Bravo and Croma are “inflated” since they include two completely separate Fiat models (one form the 90s and one from the 2000s), and the new Tipo just didn’t make it yet on the ads, but they will. They are quite popular in the last two years.

Let’s see the average prices divided by brands:

Most expensive Brands

Jeep, Infiniti… I should have made a filter to include only cars with at least 50 or 100 rows…

Least expensive Brands

The least expensive cars are. as expected the domestic and defunct Yugo and Zastava (with the older Zastava holding still a better price!), followed by Daewoo - they do not exist anymore, so all these cars are old - another perk of making such analysis… If we take the same approach but include only cars younger than 10 years we get completely different results, but much more relevant:

Least expensive Brands (under 10yrs)

Cherry, Tata, Lada, Dacia… now we’re talking.

Let’s take a look at the car age / price distribution.

Average price by age

Too smoth to be true… It makes sense, with the spike at the age of two for people who change cars frequently and sell two year old cars which happen to be more expensive, I guess?

Km by age

This is where the lying takes place. It just doesn’t make sense that the line doesn’t have a continuous growth. It seems to fluctuate around the magic psychological thershold of around 170-180 thousand km…

Additional column - country

I believe that there is an implicit information hidden in the brand’s country of origin. Some people simply prefer german cars, there is a pretty strong french cluster - people who worship the Peugeot/Citroen/Renault vehicles, there might be a tiny italian niche as well. I will add a new variable denoted country and then hot-enode it with the others.

I decided to make a couple of groups: - german - -italian - french - asian - USA - EU - for all the spanish, czech, romanian, british (?) and scandinavian cars - YUSSR - yugoslavian and russian cars

german = ['vw','opel','audi','bmw','mercedes','mini']
italian = ['fiat','alfa romeo','lancia']
french = ['peugeot','renault','citroen']
asian = ['toyota','nissan','mazda','hyundai','kia','mitsubishi','daewoo','daihatsu','honda','subaru','suzuki']
usa = ['ford','dodge','chrysler','chevrolet']
eu = ['škoda','seat','dacia','jaguar','rover','smart','saab']
yussr = ['yugo','zastava','lada']

def get_country(item):
    if item.lower() in german:
        return 'German'
    if item.lower() in italian:
        return 'Italian'
    if item.lower() in french:
        return 'French'
    if item.lower() in asian:
        return 'Asian'
    if item.lower() in usa:
        return 'USA'
    if item.lower() in eu:
        return 'EU'
    if item.lower() in yussr:
        return 'YuSSR'
    return "Other"

What I got was this:

Car by Origin Group Count

Again, lot’s of german cars, some french cars and then smaller clusters.

Two files - with and without encodings

I decided to make two files here: one without encoding the categorical variables - for the CatBoost algorithm (it got the name by the fact that it handles categorical variables) and one with One Hot Encoding of the variables. You know, when we get categories with names like Brand_Fiat, Brand_Volvo and so on and they can take only zeros and ones as values.

raw_data.to_csv('data_with_categories.csv',index=False)

raw_data = pd.concat([raw_data, pd.get_dummies(raw_data['gearbox'],prefix='gearbox', drop_first=True)], axis=1).drop(['gearbox'], axis=1)

We then da the sam for all the categorical variables:

categoricals = ['brand','make','doors', 'imported','fuel', 'color', 'aircon', 'car_type', 'drive','country']
for cat in categoricals:
    print("One hot encoding category: ", cat)
    raw_data = pd.concat([raw_data, pd.get_dummies(raw_data[cat],prefix=cat, drop_first=True)], axis=1).drop([cat], axis=1)

This procedure gives us a dataframe with 587 columns!

The one-hot-encoded data can be further processed by applying a log1 transformation to the output variable (or a Box-Cox transform) so I ended up with two files: one with the output transformed and another one with the original variable.

Car Price after applying log1

Much nicer and more normal-like, we can agree on that.

That’s it for this one. We ended up with three files, data has been cleaned and filtered, transformations applied and we got to know our data a bit better. Next stop: linear models!