Cleaning Cars... data (Cars: 2)

Published on

This is the second part of an ongoing series of posts on a project. The project consists in the gathering of data about used cars, analysis and modelling of the price as the target variable.


In this post I will try to explain the logic behind the first part of the data cleaning process. In the previous post link I explained my quick-and-dirty process of scraping the car adverts into MongoDB. Now that they are safe in the document store, I can make a query and get them out (I’ve made a new read-only user for this purpose) with something like this:

from pymongo import MongoClient
mongoURI = "mongodb+srv://"
client = MongoClient(mongoURI)

db = client.carAnalysis
carsCollection =

start = datetime(2018, 3, 1, 7, 51, 4)
end = datetime(2020, 3, 6, 7, 52, 4)

new_cars = carsCollection.find({'timeParsed': {'$lt': end, '$gt': start}})

The heavy-lifying and really boring part is putting the MongoDB data into a pandas dataframe, it can take quite a while:

df = pd.DataFrame(list(carsCollection.find({'timeParsed': {'$lt': end, '$gt': start}})))

After that, I wrote a bunch of one-off functions that basically do some simple text processing, getting rid of some words and properly formatting the features, returning the right types, floats or integers and so on:

def get_km(items):
    for item in items:
        if 'prešao kilometara' in item.lower():
            return int(item.split()[-1])


One of the semi-arbitrary features that I used is the color of the car. After some extraction, I was left with this:

siva                       2287
crna                       1452
bela                        899
plava                       800
srebrna                     360
crvena                      352
teget                       191
zelena                      152
braon                       128
bordo                       102
svetlosiva                   74
zlatna                       59
bež                          39
žuta                         35
kameleon                     29
ljubičasta                   24
svetlozelena                 23
narandžasta                  20
tirkiz                       16
ostalo                       16
unutrašnjosticrna            11
višnja                       11
unutrašnjostisiva             7
krem                          6
smb                           5
unutrašnjostibež              3
limunžuta                     2
unutrašnjostisvetlosiva       1
unutrašnjostibordo            1
unutrašnjostiplava            1
unutrašnjostiostalo           1
Name: color, dtype: int64

Afterwards I made some binning and grouping and I came up with just a small number of colors, as I believe other analyst did. It makes sense to me, especially for luxurious or business cars which are mostly gray, black or white…

Colors histogram

Basically 4 pretty even groups: white, black, gray and various.

Another feature was weather the car was imported or not, or basically if it was sold unregistered, directly from a truck that came from Germany or Italy, or it was driven in Serbia.

Air condition

After processing the air condition types, I was left with this:

automatic    3097
manual      2057
"has airocon"      1387
no aircon    696
--no data --    170

I supposed that no data and no aircon meant the same thing. Anyway, I ended up giving a score for the aircon feature, ranging from 0 - no aircon, to 4 - automatic, with 1 being no data:

Medium Pice vs Air Conditioning

This doesn’t make many sense, but maybe the really pricey car do not need to emphasize the fact that they have an automatic aircon?.

There were very few damaged cars, but I kept them. It is a variable that really depreciates the car, whatever the damage might be and people tend to mention only if the car was hit by an airplane or something like that, so I don’t know how informative it really is… Only 34 cars out of 7407…


We’re getting into the really important and decisive factors. The price. I wanted to make a sanity check as soon as the data was loaded, so I printed a simple histogram, filtering the data using the minimum of 2.000 euros and a maximum of 30.000 euros :


Prices histogram

It is heavy right-tailed as it should be.

Mileage aka kilometers

The most controversial and important variable - how many kilometers has the car actually driven?

Km histogram

In the Balkans, this is the million dollar question… Everybody cheats, so nobody’s cheating, right? It seems that the magic number is around 160.000km, a kind of a psychological barrier… I’ll get into that a bit later, for now let’s just say that the histogram looks legit.

Car types

I don’t know, maybe car type isn’t the right term, but who cares. What I mean is: sedane, limo, SUV… that kind of stuff.

What I got was this:

Hatchback  2327
Caravan 1768
Limo 1416
Minibus/van 705
SUV 572
Small vehicle 164
Ppickup 160
Coupe 152
Cabrio 59
Diesel  56
Benzin 24

The last two didn’t get the memo right, but the rest is pretty OK and represents the car population pretty accurately as far as I, as a driver, can tell.

The number of doors is also considered, and then I got to see which are the most popular car makers in Serbia. No surrises here, there are numerous jokes about the Yugoslav’s love and devotion for the mythical VolksWagen, and the data doesn’t lie:

VW             986
Opel           814
Renault        658
Peugeot        568
Ford           509
Fiat           501
Audi           466
BMW            376
Citroen        373
Mercedes       352
Škoda          333
Seat           148
Nissan         123

Car standard

One thing you can’t lie about your car is the standard - Euro 2, Euro 5 etc. And the data, again, backs it up, revealing a pretty old car structure in the country, reflecting the low purchasing power of the population.

Tip_motora_Euro_4            2657
Tip_motora_Euro_3            1743
Tip_motora_Euro_5            1372
Tip_motora_Euro_6             456
Tip_motora_Euro_2             263
Tip_motora_Euro_1              61
Tip_motora_Drugo               44

Sadly, mostly Euro 3. Not much to add here, I guess.

Drive - front wheel, back of 4 wheel drive?

Turns out this is preeeety important and, unfortunately, there were a lot of missing values: 1247 to be precise. I believe I was able to solve this problem by making a utility function that would basically take all the cars of the same brand and make and then cast a majority vote on the drive. If there aren’t any other similar cars, I return simply front wheel drive since it is the majority overall. While not infallable, I believe this method rarely makes mistakes.

def infer_drive(row):
    if row.make:
        make = row.make
            res = str(dirty_data[dirty_data.make==make].drive.value_counts().index[0])
        except IndexError:
            res = 'Prednji'        
        return res

Finally, I ended up with a dataframe with the following columns:

  • year
  • price
  • km
  • imported (yes/no)
  • kW
  • cm3
  • registered (yes/no)
  • aircon
  • damage (yes/no)
  • standard

I saved the data to a comma separated file and… I’ll see you in the next post link.