Expedia Kaggle data mining competition



  1. The procedure of data mining
  2. How to carefully analyze the providing data.
  3. Mean Average Precision: the evaluation method
  4. In such condition we can not use machine learning
    • There are millions of rows, which increases runtime and memory usage for algorithms
    • There are 100 different clusters, and according to the competition admins, the boundaries are fairly fuzzy, so it will likely be hard to make predictions. As the number of clusters increases, classifiers generally decrease in accuracy.
    • Nothing is linearly correlated with the target (hotel_clusters), meaning we can’t use fast machine learning techniques like linear regression.
  5. How to create feature by ourselves
  6. How to check the correlation between  the feature and label
  7. Provide PCA or Downsampling method when the data or feature amount is really huge
  8. How to flexibly use Pandas
  9. The CV data need to manually prepared some times, and the format need to be similar to testing data



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s