Predictive Analytics of Donors in Crowd Funding Platforms

Shradhanjali Last Updated : 12 Jul, 2022

6 min read

This article was published as a part of the Data Science Blogathon.

Introduction to Predictive Analytics

DonorsChoose.org is an online charity platform where thousands of teachers may submit requests through the online portals for materials and particular equipment to ensure that all kids have equal educational chances. The project is based on a Kaggle Competition for Predicting Excitement at DonorsChoose.org, It looks for initiatives that are both noteworthy and successful.

DonorsChoose is a platform that uses crowdsourcing to close the education funding gap. Since 2000, they have helped 40 million students in the United States receive $970 million in donations. However, around a third of all projects on DonorsChoose are not able to complete the target within four months of their first posting. By ensuring that more projects meet their fundraising targets, this project will aid DonorsChoose in closing the education funding gap. We’ll build an early warning system that detects recently posted projects that are likely to fall short of their financial targets, allowing DonorsChoose to intervene with a donation matching award.DonorsChoose is a publicly available dataset that provides information about teacher-proposed school projects in the US. By using machine learning models to determine donors’ behavior and filter the important projects from the dataset, the suggestion would address the challenge of keeping existing contributors.

You can download the dataset from Kaggle, here is the link.

^Architecture

I have focused heavily on the topic of Text Pre-Processing. I would like to summarize all the important steps in one post. Here I will not go into the theoretical background. For that, please read my earlier posts, where I explained in detail what I did and why. Refer

Objective for Predictive Analytics

The objective of the competition is to implement different ML algorithms on the DonorsChoose Dataset and calculate/analyze the accuracy of the Test dataset.

Importing all the necessary libraries to perform the analysis.

Naive Bayes Classifier

1. Gaussian Naive Bayes

Gaussian naïve Bayes Classifier is maybe the most uncomplicated naive Bayes classifier to grasp. We believe these values are sampled from a Gaussian distribution when the predictors take up a continuous value and are not discrete.

2. Multinomial Naive Bayes

Multinomial Naive Bayes is an NLP-based technique that determines the likelihood of each tag for a certain sample and produces the tag with the highest likelihood. The assumption that Gaussian is the single, obvious assumption that is used for generative distribution. Multinomial naive Bayes is another helpful example, in which the features are believed to be generated by a simple multinomial distribution. Multinomial naive Bayes is best for features that represent counts or count rates since the multinomial distribution describes the chance of detecting counts or repetitive words among a number of categories however, instead of modelling the data distribution with the best-fit Gaussian, we model it with the best-fit multinomial distribution.

Loading the Data

data = pandas.read_csv('preprocessed_data-Copy1.csv')

print(data.shape)
print(data["project_is_approved"].value_counts(normalize=True))
data.describe(include=["object", "bool"])

Splitting data into Train and Test: Stratified Sampling

y = data['project_is_approved'].values
print(y)
X = data.drop(['project_is_approved'], axis=1)#droping the Y value as we want to predict the output
X.shape
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, stratify=y,random_state=42)

Encode the text features i.e. Essay or Project title

BOW

We will use this function to discover all unique words in the data and give a dimension-number to each one.
If you have a review that says __’ very horrible pizza,’__ you may express each unique phrase with a measurement number like this: dict =’ very’: 1, ‘bad’: 2, ‘pizza’: 3.
We’ll build a python dictionary to store all of the unique words, with the key representing the unique phrase and the associated value being the dimension-number.

vectorizer = CountVectorizer(min_df=10,ngram_range=(1,4), max_features=7000)
text_bow = vectorizer.fit(X_train['essay'].values)
print(text_bow.get_feature_names())
X_train_essay = vectorizer.transform(X_train['essay'].values)
X_test_essay = vectorizer.transform(X_test['essay'].values)

TFIDF

What does the abbreviation TF-IDF mean?

The TF-IDF weight is a popular weight in text mining and information retrieval. Term frequency-inverse document frequency is known as Term Frequency and Inverse Document Frequency(TF-IDF).
This action is used to determine the importance of a word in a group or corpus of texts. The importance of a word increases directly related to its number of appearances in the text, although this is counterbalanced by its frequency in the corpus.
Variations of the TF-IDF weighting techniques are often used by search engines to score and rate a document’s relevance in response to a user query. One of the most used strategies is to add the TF-IDF for each query phrase; There are much more intricate ranking algorithms.

tfidf_vector = TfidfVectorizer(min_df=10,max_features=7000)
Text = tfidf_vector.fit(X_train['essay'].values)
print(Text.get_feature_names())
X_train_essay_TFIDF = tfidf_vector.transform(X_train['essay'].values)
X_test_essay_TFIDF = tfidf_vector.transform(X_test['essay'].values)
print(X_train_essay_TFIDF.shape, y_train.shape)
print(X_test_essay_TFIDF.shape, y_train.shape)

Encode the categorical and numerical features

vectorizer_state = CountVectorizer()
vectorizer_state.fit(X_train['school_state'].values)# fit has to happen only on train data
X_train_state = vectorizer_state.transform(X_train['school_state'].values)
X_test_state = vectorizer_state.transform(X_test['school_state'].values)

X_train_price = normalizer.fit_transform(X_train['price'].values.reshape(1,-1))
X_test_price = normalizer.fit_transform(X_test['price'].values.reshape(1,-1))
X_train_price =X_train_price.reshape(-1,1)
X_test_price =X_test_price.reshape(-1,1)
normalizer.fit(X_train['price'].values.reshape(1,-1)) #fitting

Concatenating all the features

from scipy.sparse import hstack
X_tr = hstack((X_train_essay,'All encoded features')).tocsr()
X_te = hstack((X_test_essay,'All encoded features')).tocsr()

Applying Naive Bayes

naive_bayes_1 =MultinomialNB(class_prior=[0.5,0.5])
parameter_1 ={'alpha': list(np.arange(0.001,100,2))}
print(parameter_1)
classifier_1 = GridSearchCV(naive_bayes_1,parameter_1,scoring='roc_auc',cv=10,return_train_score=True)
classifier_1.fit(X_tr,y_train)
train_auc_1= classifier_1.cv_results_['mean_train_score']

_{Word Cloud}

A word cloud is a text data visualization approach in which the most commonly used term is shown in the largest font size. We’ll learn how to make a custom word cloud in Python in this post.

from wordcloud import WordCloud
from wordcloud import STOPWORDS 
#convert list to string and generate
words_string=(" ").join(fp_words)
wordcloud = WordCloud(width = 1000, height = 500).generate(words_string)
plt.figure(figsize=(25,10))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()

^{Decision Tree classifier}

Difficulties involving categorization and regression, a decision tree is a widely used non-parametric effective machine learning modeling tool. The data is classified using decision trees and is explored in length in the following sections.

To split a node, decision tree algorithms leverage information gain. The criterion for determining information gain is the Gini index or entropy.

Gini and entropy are both the dimensions of a node’s impurity. A node with only one class is pure, as opposed to a node with several classes. The Gini measurement is the likelihood of a random sample being erroneously classified if a label is randomly selected based on the distribution in a branch. Information is measured by entropy. A split is used to compute the information gained.

All the preprocessing part is the same for all models but we’re applying different models to check which model performs well in this dataset. Let’s apply the Decision tree to the preprocessed data.Before that let’s calculate sentiment scores.

import nltk

from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

sample_sentence_1='I am happy.'

ss_1 = sid.polarity_scores(sample_sentence_1)

print('sentiment score for sentence 1',ss_1)

Output : sentiment score for sentence 1 {‘neg’: 0.0, ‘neu’: 0.213, ‘pos’: 0.787, ‘compound’: 0.5719}

param_grid= {"max_depth": [1,5,10,50],"min_samples_split": [5,10,100,500]}
model = DecisionTreeClassifier()
clf = GridSearchCV(model,param_grid, cv=3, scoring='roc_auc',return_train_score=True)
clf.fit(X_train1, y_train)

You can refer to this link to know more about the heatmap and confusion matrix.

^{Gradient-boosting Decision Tree (GBDT)}

Gradient-boosting differs from AdaBoost in that instead of applying weights to specific samples, GBDT will fit a decision tree based on the preceding tree’s residuals error (thus the name “gradient”). As a result, instead of learning by predicting the target directly, each new tree in the ensemble predicts the prior learner’s inaccuracy.

All the preprocessing part is the same for all models but we’re applying different models to check which model performs well in this dataset. Let’s apply the Decision tree to the preprocessed data.Before that let’s calculate sentiment scores.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
parameters = {"max_depth":[1,2,3,4],"min_samples_split":[5,10,15,20] }
clf = GridSearchCV(GradientBoostingClassifier(), parameters, cv=5, scoring='roc_auc',return_train_score=True,n_jobs=-1)
clf.fit(X_train1,y_train)

Conclusion to Predictive Analytics

To summarize the data from different models we can use pretty-table for example:

from prettytable import PrettyTable
pretty_table = PrettyTable()
pretty_table.field_names = ["Vectorizer","Model","max_depth","min_samples_spli1t","AUC"]
pretty_table.add_row(["Tfidf","classifier_name ","4","20","0.68"])
pretty_table.add_row(["Tfidf_w2v","classifier_name  ","10","500","0.59"])
print(pretty_table)

We evaluate each using machine learning techniques aspect of the DonorsChoose dataset in this research.
Donors approve projects with a high level of project description, according to our findings, because it shows that the essays have an important effect on the AUC score.
From the analysis, we can clearly see that the project resource summary and DateTime had a positive outcome, indicating that projects that are budget-friendly and provide a clear explanation of how utilities should be used are important.
Important terms like a workbook, assistant, and classwork were chosen as high likelihood values in both negative and positive classes so using the Naive Bayes technique we can see the probability score for each word. When dealing with textual data, distance-based algorithms suffer from the curse of dimensionality, and training neural networks appears to be more promising.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Shradhanjali

Hello Shradhanjali Pradhan! It's great to meet you. As a working professional and an AI/ML engineer, you must have a strong foundation in data science. Your knowledge and skills in this field are undoubtedly invaluable to any organization that values data-driven decision-making. I'm excited to hear more about your experiences and insights in the field, and I'm here to assist you in any way I can. Let me know if you have any questions or if there's anything I can help you with.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Arun CR

Hi Sunil , From the weather and play table which is table [1] we know that frequency of sunny is 5 and play when sunny is 3 no play when suny is 2 so probability(play/sunny) is 3/5 = 0.6 Why do we need conditional probabilty to solve this? Is there problems that can be solved only using conditional probability. can you suggest such examples. Thanks, Arun

Show 4 reply

matt

Arun, An example of a problem that requires the use of conditional probability is the Monty Hall problem. https://en.wikipedia.org/wiki/Monty_Hall_problem Conditional probability is used to solve this particular problem because the solution depends on Bayes' Theorem. Which was described earlier in the article.

Erdem Karakoylu

It's a trivial example for illustration. The "Likelihood table" (a confusing misnomer, I think) is in fact a probability table that has the JOINT weather and play outcome probabilities in the center, and the MARGINAL probabilities of one variable (from integrating out the other variable from the joint) on the side and bottom. Say, weather type = w and play outcome = p. P(w,p) is the joint probabilities and P(p) and P(w) are the marginals. Bayes rule described above by Sunil stems from: P(w,p) = P(w|p) * P(p) = P(p|w) * P(w). From the center cells we have P(w,p) and from the side/bottom we get P(p) and P(w). Depending on what you need to calculate, it follows that: (1): P(w|p) = P(w,p) / P(p) and (2:)P(p|w) = P(w,p) / P(w), which is what you did with P(sunny,yes) = 3/14 and P(w) = 5/14, yielding (3/14 ) ( 14/5), with the 14's cancelling out. The main Bayes take away is that often, one of the two quantities above, P(w|p) or P(p|w) is much harder to get at than the other. So if you're a practitioner you'll come to see this as one of two mathematical miracles regarding this topic, the other being the applicability of Markov Chain Monte Carlo in circumventing some nasty integrals Bayes might throw at you. But I digress. Best, Erdem.

John Siano

I had the same question. Have you found an answer? Thanks!

Mk Gupta

Hey Arun, the reason why conditional probability is needed here because sunny is also an probable event here and it also related to on days weather . so there is condition that if it is sunny than whats is the probability of playing for conditional prob. is required. thanks

Arun CR

Great article and provides nice information.

Nishi Singh

Amazing content and useful information

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Predictive Analytics of Donors in Crowd Funding Platforms

Introduction to Predictive Analytics

Architecture

Objective for Predictive Analytics

Naive Bayes Classifier

Loading the Data

Splitting data into Train and Test: Stratified Sampling

Encode the text features i.e. Essay or Project title

BOW

TFIDF

Word Cloud

Decision Tree classifier

Gradient-boosting Decision Tree (GBDT)

Conclusion to Predictive Analytics

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

^Architecture

_{Word Cloud}

^{Decision Tree classifier}

^{Gradient-boosting Decision Tree (GBDT)}