Exploring Udemy Courses
Udemy is one of leading online teaching platforms. It offers wide variety of courses. This includes Web Development, Data Science, Music, Design etc.
Recently I found a data set on udemy courses in kaggle. Let’s apply some EDA and predictive modelling on this dataset.
EDA
The important feature of this dataset is is_paid. Price of the course really makes impact on user.
The features available in this dataset are :
['course_id', 'course_title', 'url', 'is_paid', 'price',
'num_subscribers', 'num_reviews', 'num_lectures', 'level',
'content_duration', 'published_timestamp', 'subject',
'content_time_value', 'content_time_unit',
'content_multiplier',
'engagement']
It offers 3678 courses. It’s really a huge number.
Among them 3368 are paid courses.
data_paid= data[data['is_paid']==True]
Most of the users opting courses by its popularity i.e., courses with more number of subscribers .
The course with highest number of subscribers can be obtained with following code
data_paid[data_paid['num_subscribers']==max(data_paid['num_subscribers'])]
Top 10 popular courses are:
data_paid_10=data_paid.sort_values
(by='num_subscribers',ascending=False)[0:10].sort_values
("num_subscribers", ascending=False).reset_index(drop=True).reset_index()[['course_id','course_title','num_subscribers','num_reviews','price']]

Popular paid:

Popular Free :

Price vs Subscribers :

The free course with highest number of subscribers is “Learn HTML5 Programming From Scratch”
Number of Lectures is also important as user expects some good number of lectures for his money.

Number of courses in each context are :
paid courses:

free courses:

The course with highest number of lectures is “Back to School Web Development and Programming”
Number of subscribers vs Number of reviews

Distribution of price in “Business courses” is :

Distribution of price in “Development courses” is :

Number of courses in various difficulty levels :

There is no correlation among these features.

Predictive Modelling
Let’s try to estimate number of subscribers for a new course uploaded with available data.
Drop features by logic
['course_id','course_title','url','num_reviews','published_timestamp','engagement','content_multiplier']
I have used Random Forest Regressor and XGBoostRegressor.
MAE for Random Forest Regressor is 3477 and for XGBoost Regressor is 3218.
This error is alot. One thing we can do to reduce error is to include “Course Description”. Also adding some more useful features can work.
Feature importance from Random Forest :

Feature importance from XGBoost :

Conclusion
We have see some good observations from this dataset. Number of lectures and price creating more impact on user.
Complete code can be accessed from here.