Confidence and predict_proba
- What does it mean to be “confident” in your results?
- When you perform analysis, you are responsible for many judgment calls.
- Your results will be different than others.
- As you make these judgments and start to form conclusions, how can you recognize your own uncertainties about the data so that you can communicate confidently?
Let’s imagine that the following claim is true:
Vancouver has the highest cost of living of all cities in Canada.
Now let’s consider a few beliefs we could hold:
- Vancouver has the highest cost of living of all cities in Canada. I am 95% sure of this.
- Vancouver has the highest cost of living of all cities in Canada. I am 55% sure of this.
The part is bold is called a credence. Which belief is better?
But what if it’s actually Toronto that has the highest cost of living in Canada?
- Vancouver has the highest cost of living of all cities in Canada. I am 95% sure of this.
- Vancouver has the highest cost of living of all cities in Canada. I am 55% sure of this.
Which belief is better now?
We don’t just want to be right. We want to be confident when we’re right and hesitant when we’re wrong.
In our final exam, imagine if, along with your answers, we ask you to also provide a confidence score for each. This would involve rating how sure you are about each answer, perhaps on a percentage scale from 0% (completely unsure) to 100% (completely sure). This method not only assesses your knowledge but also your awareness of your own understanding, potentially impacting the grading process and highlighting areas for improvement. Who supports this idea 😉?
Loss in machine learning
When you call fit for LogisticRegression it has similar preferences:
correct and confident
> correct and hesitant
> incorrect and hesitant
> incorrect and confident
- This is a “loss” or “error” function like mean squared error, so lower values are better.
- When you call
fit it tries to minimize this metric.
Logistic regression loss
- confident and correct \(\rightarrow\) smaller loss
- hesitant and correct \(\rightarrow\) a bit higher loss
- hesitant and incorrect \(\rightarrow\) even higher loss
- confident and incorrect \(\rightarrow\) high loss
Misleding visualizations
This chart is attempting to suggest a relationship between childhood MMR vaccination rates and the prevalence of autism spectrum disorders (AD/ASD) across several countries.
Do you see any problems with this visualization?
Visualizing your data and results could be very powerful but at the same time can be misleading if not done properly.