Engee documentation
Notebook

Analysis of medical medical reports

In this demo, medical data will be analyzed using analytics tools.

During the analysis of the dataset, several metrics were calculated for the consistency and accuracy of doctors' work.

Libraries must be installed

You must manually specify the paths where the project is located in order to go to the working directory, as well as install the necessary dependencies.

In [ ]:
!cd /user/Demo_public/biomedical/analis_medical_result
In [ ]:
 !pip install -r /user/Demo_public/biomedical/analis_medical_result/requirements.txt
In [ ]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import cohen_kappa_score
from statsmodels.stats.inter_rater import fleiss_kappa
import numpy as np
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix, balanced_accuracy_score, mean_squared_error
from sklearn.metrics import precision_recall_curve, average_precision_score, f1_score, fbeta_score, roc_auc_score,ConfusionMatrixDisplay, roc_curve, auc
from sklearn.preprocessing import MinMaxScaler
from scipy.stats import sem, t

Let's read the data and see what is contained in the first 5 lines of the dataframe.

In [ ]:
focus = pd.read_csv('/user/Demo_public/biomedical/analis_medical_result/Date.csv')
In [ ]:
focus.head()
Out[0]:
ID Файла Врач№1 Врач№2 Врач№3 Врач№4 Врач№5 Врач№6 Врач№7 Врач№8 Врач№9 Врач№10 Врач№11 Врач№12 Врач№13 Врач№14 Врач№15
0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 4 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1
In [ ]:
focus = focus.drop('File ID', axis=1)

The files contain the markup of the images. Each of the 15 doctors made a conclusion based on 476 photographs in the absence or presence of pathology.

You can estimate the number of doctors who are confident in pathology for each picture and pre-evaluate where there really is pathology and where there is not. You can also look for strong discrepancies in the doctors' results. It is possible to determine how much doctors agree with each other in the definition of pathology.

We will evaluate the quality of each pathology doctor's markup.

In [ ]:
focus_corr = focus.corr(method='spearman')
filtered = focus_corr.where(focus_corr > 0.5).abs()

As can be seen from the correlation table, there are not many pairs of doctors whose decisions about gynecology coincide by at least 50 percent. There is a weak correlation.

However, there are two pairs of doctors (3 and 13)(12 and 10) which have an average correlation strength. This makes it clear that these two pairs tend to make the same markings for most of the pictures. This may indicate a high level of consistency between them regarding whether there is pathology in the image or not.

In [ ]:
focus_corr_ = focus.corr(method='spearman')
filtered_ = focus_corr_.where(focus_corr < 0.1)

There are also pairs of doctors who give completely opposite answers to each other, which means that their opinions differ by almost 100 percent.

In [ ]:
focus_corr_mean = focus_corr.mean(axis=1).sort_values(ascending=False)
focus_corr_mean
Out[0]:
DoctorNo.1 0.425707
DoctorNo.3 0.394424
DoctorNo.12 0.392694
DoctorNo.10 0.384297
DoctorNo.9 0.384255
DoctorNo.13 0.376615
DoctorNo.14 0.336404
DoctorNo.5 0.328782
DoctorNo.4 0.322743
Doctor No.11 0.317238
DoctorNo.15 0.316465
DoctorNo.6 0.312953
Doctor no.2     0.307039
DoctorNo.7 0.297679
DoctorNo.8 0.143105
dtype: float64

The table above shows the average correlation of each doctor with other doctors, this information makes it clear how much each doctor's opinion agrees with others.

In [ ]:
plt.figure(figsize=(8,8))
sns.heatmap(focus_corr.round(2), annot=True, center=0, cmap='inferno')
plt.title('Correlation between doctors' markings')
Out[0]:
Text(0.5, 1.0, 'Correlation between doctors' markings')
In [ ]:
plt.figure(figsize=(8,8))
sns.heatmap(filtered.round(2), annot=True, center=0, cmap='inferno')
plt.title('Correlation between doctors' markings with average correlation strength')
Out[0]:
Text(0.5, 1.0, 'Correlation between doctors' markings with average correlation strength')

Let's see how many doctors are confident with pathology in percentage terms for each picture.

In [ ]:
confidence = focus.sum(axis=1).sort_values(ascending=False) / focus.shape[1] * 100
confidence.head(20)
Out[0]:
433    100.000000
394     86.666667
28      80.000000
346     80.000000
352     80.000000
359     73.333333
182     73.333333
134     73.333333
113     73.333333
383     73.333333
398     73.333333
153     73.333333
295     66.666667
120     66.666667
342     60.000000
103     60.000000
341     53.333333
387     53.333333
339     53.333333
192     53.333333
dtype: float64

Thus, in the table above, you can observe the percentage probability of pathology in the picture of the corresponding ID, taking into account the decision of the doctors.

In [ ]:
average_marking = focus.mean(axis=1)
In [ ]:
average_doctor_deviation = focus.sub(average_marking, axis=0).abs().mean().sort_values(ascending=False)
average_doctor_deviation
Out[0]:
Doctor No.15 0.245702
Doctor no.2     0.211880
Doctor No.6 0.168553
Doctor No.7 0.153878
DoctorNo.8 0.145073
DoctorNo.4 0.144095
DoctorNo.12 0.140741
DoctorNo.14 0.124948
DoctorNo.11 0.122991
DoctorNo. 10 0.121454
DoctorNo.1 0.117540
Doctor No.5 0.116841
Doctor No.9 0.116143
DoctorNo.13 0.110552
DoctorNo.3 0.106918
dtype: float64

Doctor No. 15 has the highest average deviation (0.2457), which means that his markings differ significantly from the average markings for all doctors. This may indicate possible errors in the markup.

Doctor No. 3 has the lowest average deviation (0.1069), which indicates a high consistency of his markings with the average markings for all doctors. This doctor most often marked up the pictures in the same way as most other doctors.

Thus, the average deviation from the average marking for all doctors was obtained here.

Next, we calculate the covariance matrix for the doctors' assessments

In [ ]:
covariance_matrix = focus.cov()
covariance_matrix
Out[0]:
Врач№1 Врач№2 Врач№3 Врач№4 Врач№5 Врач№6 Врач№7 Врач№8 Врач№9 Врач№10 Врач№11 Врач№12 Врач№13 Врач№14 Врач№15
Врач№1 0.094031 0.046544 0.032090 0.029244 0.031869 0.054032 0.030905 0.004118 0.033649 0.046152 0.026346 0.047610 0.033750 0.041832 0.048782
Врач№2 0.046544 0.181157 0.022775 0.038361 0.020176 0.038414 0.016357 0.008676 0.025206 0.047540 0.023492 0.061801 0.028082 0.033111 0.060079
Врач№3 0.032090 0.022775 0.051645 0.022770 0.020017 0.022466 0.026743 0.007183 0.029262 0.021814 0.021431 0.027699 0.030407 0.027047 0.027972
Врач№4 0.029244 0.038361 0.022770 0.107033 0.018313 0.038643 0.020387 0.011500 0.030209 0.040260 0.016780 0.049346 0.020158 0.015248 0.039612
Врач№5 0.031869 0.020176 0.020017 0.018313 0.053512 0.017917 0.020176 0.004955 0.026994 0.023704 0.012883 0.021088 0.019780 0.020572 0.016846
Врач№6 0.054032 0.038414 0.022466 0.038643 0.017917 0.138479 0.016939 -0.003788 0.030896 0.040022 0.020031 0.047073 0.023871 0.026346 0.055983
Врач№7 0.030905 0.016357 0.026743 0.020387 0.020176 0.016939 0.110195 0.015446 0.025673 0.023030 0.033296 0.034023 0.017811 0.023307 0.034168
Врач№8 0.004118 0.008676 0.007183 0.011500 0.004955 -0.003788 0.015446 0.057220 0.003550 0.000172 0.008390 0.009967 0.004827 0.001321 0.007201
Врач№9 0.033649 0.025206 0.029262 0.030209 0.026994 0.030896 0.025673 0.003550 0.073472 0.036084 0.021788 0.038705 0.026826 0.027086 0.037326
Врач№10 0.046152 0.047540 0.021814 0.040260 0.023704 0.040022 0.023030 0.000172 0.036084 0.090693 0.016133 0.058731 0.023492 0.025369 0.054225
Врач№11 0.026346 0.023492 0.021431 0.016780 0.012883 0.020031 0.033296 0.008390 0.021788 0.016133 0.064531 0.025540 0.023241 0.019542 0.021524
Врач№12 0.047610 0.061801 0.027699 0.049346 0.021088 0.047073 0.034023 0.009967 0.038705 0.058731 0.025540 0.125478 0.029183 0.023691 0.067874
Врач№13 0.033750 0.028082 0.030407 0.020158 0.019780 0.023871 0.017811 0.004827 0.026826 0.023492 0.023241 0.029183 0.055371 0.024602 0.026730
Врач№14 0.041832 0.033111 0.027047 0.015248 0.020572 0.026346 0.023307 0.001321 0.027086 0.025369 0.019542 0.023691 0.024602 0.075234 0.028302
Врач№15 0.048782 0.060079 0.027972 0.039612 0.016846 0.055983 0.034168 0.007201 0.037326 0.054225 0.021524 0.067874 0.026730 0.028302 0.208657
In [ ]:
covariance_matrix_mean = covariance_matrix.mean(axis=1).sort_values(ascending=False)
covariance_matrix_mean
Out[0]:
DoctorNo.15 0.049.019
DoctorNo.12 0.044520
Doctor no.2     0.043451
DoctorNo.1 0.040064
DoctorNo.6 0.037822
DoctorNo.10 0.036495
DoctorNo.4 0.033191
DoctorNo.9 0.031115
DoctorNo.7 0.029897
DoctorNo.14 0.027507
DoctorNo.3 0.026088
DoctorNo.13 0.025875
DoctorNo.11 0.023663
DoctorNo.5 0.021920
DoctorNo.8 0.009383
dtype: float64

Doctor No. 15 has the highest average covariance (0.0525). This suggests that the markup of Doctor No. 15 is, on average, more consistent with the markup of other doctors. This doctor most often notes the presence or absence of pathology in the same way as other doctors.

Doctor No. 8 has the lowest average covariance (0.0101). This indicates that the markup of Doctor No. 8 is, on average, less consistent with the markup of other doctors. This doctor most often evaluates the presence or absence of pathology differently than other doctors.

It can be seen that Doctor No. 15 has the highest average covariance and the highest average deviation. This may be due to the fact that the opinion of the doctor about the pathology in the picture agrees with the majority opinion, however, he may notice hidden features more often than others and overestimate them a little.

In [ ]:
doctor_13 = focus['Doctor No. 13']
doctor_3 = focus['Doctor No. 3']
doctor_12 = focus['Doctor No. 12']
doctor_10 = focus['Doctor No. 10']
In [ ]:
cohen_kappa_score(doctor_13, doctor_3), cohen_kappa_score(doctor_12, doctor_10)
Out[0]:
(np.float64(0.5681836885853017), np.float64(0.5380704515191865))

This confirms the correlation table of the most highly consistent pairs according to the estimates of the images.

In [ ]:
counts = np.zeros((focus.shape[0], 2), dtype=int)
for idx, row in focus.iterrows():
  counts[idx, 0] = (row == 0).sum()
  counts[idx, 1] = (row == 1).sum()
In [ ]:
fleiss_kappa(counts)
Out[0]:
np.float64(0.2591132582884047)

This test shows consistency in the annotations of all doctors. As can be seen from the result, consistency is low, doctors have disagreements about the presence of pathology.

In [ ]:
kappa_scores = pd.DataFrame(focus.columns, focus.columns)
In [ ]:
for col in focus.columns:
  for col_ in focus.columns:
    kappa_scores.loc[col, col_] = cohen_kappa_score(focus[col], focus[col_])
kappa_scores = kappa_scores.drop([0], axis=1)

Thanks to this sign, you can find out which doctors agree with each other more.

In [ ]:
kappa_scores_mean = kappa_scores.mean(1).sort_values(ascending=False)
kappa_scores_mean
Out[0]:
Doctor No.1 0.411056
DoctorNo.12 0.374667
DoctorNo.10 0.370726
DoctorNo.9 0.370257
DoctorNo.3 0.369324
DoctorNo.13 0.355202
DoctorNo.14 0.324713
DoctorNo.4 0.311602
DoctorNo.5 0.310791
DoctorNo.11 0.304169
Doctor No. 6 0.296735
DoctorNo.7 0.286314
Doctor no.2     0.275085
Doctor No.15 0.270183
DoctorNo.8 0.137958
dtype: float64

It can be seen from this plate that doctor number 15 has one of the lowest average network consistency among doctors.

The Kappa Cohen coefficient, unlike the correlation table, regulates random coincidences

This may be due to the fact that the doctor often agrees with other doctors, but he makes more positive or negative diagnoses.

Let's calculate the standard deviation of the doctor's results from the average for doctors

In [ ]:
standard_deviation_mean=focus.sub(focus.mean(axis=1), axis=0).pow(2).mean().sort_values(ascending=False)
standard_deviation_mean
Out[0]:
DoctorNo.15 0.174125
Doctor no.2     0.140303
DoctorNo.6 0.096976
DoctorNo.7 0.082301
DoctorNo.8 0.073496
DoctorNo.4 0.072518
DoctorNo.12 0.069164
DoctorNo. 14 0.053371
DoctorNo.11 0.051414
DoctorNo.10 0.049877
DoctorNo.1 0.045963
DoctorNo.5 0.045264
DoctorNo.9 0.044566
DoctorNo.13 0.038975
DoctorNo.3 0.035341
dtype: float64

In this case, the MSE from the average of doctors helps to understand which doctors have the highest deviation, who most often gives estimates that are inconsistent with other doctors for a particular snapshot.

In [ ]:
std_mean=focus.std().sort_values(ascending=False)
std_mean
Out[0]:
DoctorNo.15 0.456790
Doctor no.2     0.425625
DoctorNo.6 0.372128
DoctorNo.12 0.354229
DoctorNo.7 0.331956
DoctorNo.4 0.327159
DoctorNo.1 0.306645
DoctorNo.10 0.301153
DoctorNo. 14 0.274288
DoctorNo.9 0.271057
Doctor No.11 0.254030
DoctorNo.8 0.239208
DoctorNo.13 0.235310
Doctor No.5 0.231327
DoctorNo.3 0.227254
dtype: float64

The COEX system helps to see the doctors who have the highest variation.

Next, we will get a "Reference score" for each picture, which will be calculated based on the sum of votes, that is, if most of the doctors are in favor of having a pathology, then we set 1, otherwise - 0.

In [ ]:
diagnosis_major = pd.Series([0 if x > y else 1 for x, y in zip(counts[:, 0], counts[:, 1])])
In [ ]:
accuracy_df = focus.apply(lambda x: accuracy_score(diagnosis_major, x)).sort_values(ascending=False)
accuracy_df
Out[0]:
DoctorNo.3 0.955975
DoctorNo.5 0.949686
DoctorNo.13 0.947589
DoctorNo.9 0.943396
DoctorNo.1 0.943396
DoctorNo.14 0.932914
DoctorNo.11 0.928721
DoctorNo.10 0.926625
DoctorNo.12 0.893082
DoctorNo.4 0.888889
DoctorNo.7 0.884696
DoctorNo.8 0.882600
Doctor No.6 0.870021
Doctor no.2     0.807128
DoctorNo.15 0.761006
dtype: float64

In the table above, the accuracy of the assessments of doctors' images was obtained in comparisons with reference estimates that were obtained on the basis of the majority concept, that is, the more confirmations of doctors' diagnoses a particular image has, the higher the probability that it is indeed a pathology.

In [ ]:
recall_df = focus.apply(lambda x: recall_score(diagnosis_major, x)).sort_values(ascending=False)
recall_df
Out[0]:
DoctorNo.15 0.965517
DoctorNo.1 0.896552
Doctor no.2     0.862069
DoctorNo.12 0.827586
DoctorNo.6 0.793103
DoctorNo.10 0.724138
Doctor No.9 0.689655
DoctorNo.14 0.620690
DoctorNo.7 0.586207
DoctorNo.4 0.586207
DoctorNo.3 0.586207
DoctorNo.5 0.551724
DoctorNo.13 0.551724
Doctor No.11 0.482759
DoctorNo.8 0.034483
dtype: float64

Recall can be used to assess how often the doctor correctly identifies the presence of pathologies in the images.

In [ ]:
columns = focus.columns
confusion_matrices = {}
for column in columns:
  confusion_matrix_ = confusion_matrix(diagnosis_major, focus[column])
  confusion_matrices[column] = confusion_matrix_.sum() - np.diag(confusion_matrix_).sum()
In [ ]:
errors_df = pd.DataFrame(list(confusion_matrices.items()), columns=['Doctor', 'Number of errors']).sort_values(ascending=False, by='Number of errors')
errors_df
Out[0]:
Врач Количество ошибок
14 Врач№15 114
1 Врач№2 92
5 Врач№6 62
7 Врач№8 56
6 Врач№7 55
3 Врач№4 53
11 Врач№12 51
9 Врач№10 35
10 Врач№11 34
13 Врач№14 32
0 Врач№1 27
8 Врач№9 27
12 Врач№13 25
4 Врач№5 24
2 Врач№3 21

By counting FP + FN, we can see which of the doctors has the most errors.

In [ ]:
counts[:, 0].sum(), counts[:, 1].sum()
Out[0]:
(np.int64(6316), np.int64(839))

(6316, 839) - the total number of doctors' diagnoses. 6316 - the number of votes of doctors for the absence of pathology, 839 - the presence. As you can see, there is a slight class imbalance, let's try to apply accuracy_balanced from sklearn.

In [ ]:
balanced_df = focus.apply(lambda x: balanced_accuracy_score(diagnosis_major, x)).sort_values(ascending=False)
balanced_df
Out[0]:
DoctorNo.1 0.921490
DoctorNo.12 0.862454
DoctorNo.15 0.856643
DoctorNo.6 0.834052
Doctor no.2     0.832820
DoctorNo.10 0.831935
DoctorNo.9 0.824738
DoctorNo.14 0.786907
DoctorNo.3 0.783059
Doctor No.5 0.763585
DoctorNo.13 0.762469
DoctorNo.4 0.747345
DoctorNo.7 0.745112
DoctorNo.11 0.720174
DoctorNo.8 0.485991
dtype: float64

Let's build graphs that visually display information about which doctor coped best with the diagnosis. In total, 10 metrics were calculated, and we will build 10 graphs.

In [ ]:
colors = ['blue', 'green', 'red', 'purple', 'orange', 'yellow', 'brown', 'pink', 'gray', 'cyan', 'magenta', 'lime', 'teal', 'navy']

fig, axes = plt.subplots(nrows=5, ncols=2,figsize=(13, 24))
axes = axes.flatten()
axes[0].bar(focus_corr_mean.index.tolist(), focus_corr_mean.values.tolist(), color=colors[0])
axes[0].set_title('Average correlation by doctors')
axes[0].set_xlabel('Doctor')
axes[0].set_ylabel('Average correlation')
axes[0].tick_params(axis='x', rotation=45)
axes[1].bar(average_doctor_deviation.index.tolist(), average_doctor_deviation.values.tolist(), color=colors[1])
axes[1].set_title('Average deviation by doctors')
axes[1].set_xlabel('Doctor')
axes[1].set_ylabel('Average deviation')
axes[1].tick_params(axis='x', rotation=45)
axes[2].bar(covariance_matrix_mean.index.tolist(), covariance_matrix_mean.values.tolist(), color=colors[2])
axes[2].set_title('Average covariance by doctors')
axes[2].set_xlabel('Doctor')
axes[2].set_ylabel('Average covariance')
axes[2].tick_params(axis='x', rotation=45)
axes[3].bar(kappa_scores_mean.index.tolist(), kappa_scores_mean.values.tolist(), color=colors[3])
axes[3].set_title('Average Capo-Cohen score by doctors')
axes[3].set_xlabel('Doctor')
axes[3].set_ylabel('Average Capo-Cohen score')
axes[3].tick_params(axis='x', rotation=45)
axes[4].bar(standard_deviation_mean.index.tolist(), standard_deviation_mean.values.tolist(), color=colors[4])
axes[4].set_title('Standard deviation according to doctors')
axes[4].set_xlabel('Doctor')
axes[4].set_ylabel('Standard deviation')
axes[4].tick_params(axis='x', rotation=45)
axes[5].bar(std_mean.index.tolist(), std_mean.values.tolist(), color=colors[5])
axes[5].set_title('Average COE by doctors')
axes[5].set_xlabel('Doctor')
axes[5].set_ylabel('Average COE')
axes[5].tick_params(axis='x', rotation=45)
axes[6].bar(accuracy_df.index.tolist(), accuracy_df.values.tolist(), color=colors[6])
axes[6].set_title('Majority accuracy by doctors')
axes[6].set_xlabel('Doctor')
axes[6].set_ylabel('Majority accuracy')
axes[6].tick_params(axis='x', rotation=45)
axes[7].bar(recall_df.index.tolist(), recall_df.values.tolist(), color=colors[7])
axes[7].set_title('Majority completeness by doctors')
axes[7].set_xlabel('Doctor')
axes[7].set_ylabel('Majority completeness')
axes[7].tick_params(axis='x', rotation=45)
axes[8].bar(errors_df['Doctor'], errors_df['Number of errors'], color=colors[8])
axes[8].set_title('Errors by doctors')
axes[8].set_xlabel('Doctor')
axes[8].set_ylabel('errors in completeness')
axes[8].tick_params(axis='x', rotation=45)
axes[9].bar(balanced_df.index.tolist(), balanced_df.values.tolist(), color=colors[9])
axes[9].set_title('Majority balanced accuracy by doctors')
axes[9].set_xlabel('Doctor')
axes[9].set_ylabel('Majority balanced accuracy')
axes[9].tick_params(axis='x', rotation=45)
plt.subplots_adjust(hspace=0.6)
plt.tight_layout()

Let's analyze the graphs: doctors who have a high average correlation among doctors most often have the same opinion about the image as other doctors. High covariance is responsible for variability, that is, the higher this parameter, the more the results of a particular doctor differ from the opinion of all doctors on the image. The Copa-Cohen score helps to identify a doctor's alignment with other doctors' decisions. This score, unlike the average correlation, takes into account the randomness of medical decisions.

The average deviation, the standard deviation, helps to see how much the opinion of a particular doctor deviates from the average opinion of doctors, but the COE makes it clear numerically the variation in the average estimates of doctors, that is, roughly speaking, how far the opinion of a doctor is from the average.

The majority rating of each image was calculated, that is, pathology was evaluated based on the opinion of the majority. We will consider this a benchmark. Accuracy shows how well a doctor can correctly make a diagnosis (absence and presence of pathology), but completeness shows how well a doctor finds pathology in cases where it really exists. The balanced accuracy was calculated under the conditions that there are several times more cases predicted by doctors when there is no pathology than when there is, that is, a slight imbalance.

In [ ]:
fig, axes = plt.subplots(figsize=(7, 7))
axes.bar(accuracy_df.index.tolist(), accuracy_df.values.tolist(), color=colors[6], alpha=1, label='Majority accuracy')
axes.set_title('Majority accuracy and balanced accuracy by doctors')
axes.set_xlabel('Doctor')
axes.set_ylabel('Majority accuracy, balanced accuracy')
axes.tick_params(axis='x', rotation=45)
axes.bar(balanced_df.index.tolist(), balanced_df.values.tolist(), color=colors[9], alpha=0.5, label='Majority balanced accuracy')
axes.legend()
axes.set_ylim(0, 1.25)
plt.tight_layout()
In [ ]:
plt.figure(figsize=(8, 6))
sns.boxplot(data=confidence, orient='v', color='lightblue')
plt.ylabel('The probability of pathology (%)')
plt.title('Probability distribution of pathology')
plt.tight_layout()
plt.show()

As you can see from the graph above, most of the images do not have a 100percent guarantee that there is pathology there. In just one picture, all the doctors said there was a pathology.

Let's try to create a single metric that combines all of the above

Let's select the worst doctors for each of the pathologies.

In [ ]:
metrics_df = pd.concat([focus_corr_mean, average_doctor_deviation, covariance_matrix_mean, kappa_scores_mean, standard_deviation_mean, std_mean, accuracy_df, recall_df, balanced_df, errors_df.set_index('Doctor')['Number of errors']], axis=1)
metrics_df.columns=['focus_corr_mean', 'average_doctor_deviation', 'covariance_matrix_mean', 'kappa_scores_mean', 'standard_deviation_mean', 'std_mean', 'accuracy_df', 'recall_df', 'balanced_df', 'errors_df']
In [ ]:
metrics_df
Out[0]:
focus_corr_mean average_doctor_deviation covariance_matrix_mean kappa_scores_mean standard_deviation_mean std_mean accuracy_df recall_df balanced_df errors_df
Врач№1 0.425707 0.117540 0.040064 0.411056 0.045963 0.306645 0.943396 0.896552 0.921490 27
Врач№3 0.394424 0.106918 0.026088 0.369324 0.035341 0.227254 0.955975 0.586207 0.783059 21
Врач№12 0.392694 0.140741 0.044520 0.374667 0.069164 0.354229 0.893082 0.827586 0.862454 51
Врач№10 0.384297 0.121454 0.036495 0.370726 0.049877 0.301153 0.926625 0.724138 0.831935 35
Врач№9 0.384255 0.116143 0.031115 0.370257 0.044566 0.271057 0.943396 0.689655 0.824738 27
Врач№13 0.376615 0.110552 0.025875 0.355202 0.038975 0.235310 0.947589 0.551724 0.762469 25
Врач№14 0.336404 0.124948 0.027507 0.324713 0.053371 0.274288 0.932914 0.620690 0.786907 32
Врач№5 0.328782 0.116841 0.021920 0.310791 0.045264 0.231327 0.949686 0.551724 0.763585 24
Врач№4 0.322743 0.144095 0.033191 0.311602 0.072518 0.327159 0.888889 0.586207 0.747345 53
Врач№11 0.317238 0.122991 0.023663 0.304169 0.051414 0.254030 0.928721 0.482759 0.720174 34
Врач№15 0.316465 0.245702 0.049019 0.270183 0.174125 0.456790 0.761006 0.965517 0.856643 114
Врач№6 0.312953 0.168553 0.037822 0.296735 0.096976 0.372128 0.870021 0.793103 0.834052 62
Врач№2 0.307039 0.211880 0.043451 0.275085 0.140303 0.425625 0.807128 0.862069 0.832820 92
Врач№7 0.297679 0.153878 0.029897 0.286314 0.082301 0.331956 0.884696 0.586207 0.745112 55
Врач№8 0.143105 0.145073 0.009383 0.137958 0.073496 0.239208 0.882600 0.034483 0.485991 56

Let's summarize the metrics

In [ ]:
scaler = MinMaxScaler()
normalized_df = pd.DataFrame(scaler.fit_transform(metrics_df), index=metrics_df.index, columns=metrics_df.columns)
In [ ]:
normalized_df
Out[0]:
focus_corr_mean average_doctor_deviation covariance_matrix_mean kappa_scores_mean standard_deviation_mean std_mean accuracy_df recall_df balanced_df errors_df
Врач№1 1.000000 0.076536 0.774068 1.000000 0.076536 0.345876 0.935484 0.925926 1.000000 0.064516
Врач№3 0.889305 0.000000 0.421469 0.847191 0.000000 0.000000 1.000000 0.592593 0.682131 0.000000
Врач№12 0.883183 0.243706 0.886512 0.866755 0.243706 0.553179 0.677419 0.851852 0.864440 0.322581
Врач№10 0.853470 0.104733 0.684026 0.852324 0.104733 0.321947 0.849462 0.740741 0.794362 0.150538
Врач№9 0.853321 0.066465 0.548299 0.850607 0.066465 0.190834 0.935484 0.703704 0.777837 0.064516
Врач№13 0.826287 0.026183 0.416106 0.795481 0.026183 0.035093 0.956989 0.555556 0.634853 0.043011
Врач№14 0.683997 0.129909 0.457279 0.683842 0.129909 0.204907 0.881720 0.629630 0.690969 0.118280
Врач№5 0.657025 0.071501 0.316315 0.632860 0.071501 0.017741 0.967742 0.555556 0.637416 0.032258
Врач№4 0.635656 0.267875 0.600673 0.635830 0.267875 0.435245 0.655914 0.592593 0.600124 0.344086
Врач№11 0.616178 0.115811 0.360295 0.608615 0.115811 0.116653 0.860215 0.481481 0.537734 0.139785
Врач№15 0.613442 1.000000 1.000000 0.484167 1.000000 1.000000 0.000000 1.000000 0.851096 1.000000
Врач№6 0.601015 0.444109 0.717502 0.581393 0.444109 0.631160 0.559140 0.814815 0.799222 0.440860
Врач№2 0.580087 0.756294 0.859540 0.502117 0.756294 0.864227 0.236559 0.888889 0.796394 0.763441
Врач№7 0.546967 0.338369 0.517571 0.543235 0.338369 0.456147 0.634409 0.592593 0.594998 0.365591
Врач№8 0.000000 0.274924 0.000000 0.000000 0.274924 0.052077 0.623656 0.000000 0.000000 0.376344

Let's set a weighting factor for each of the metrics

In [ ]:
weights = {
    'kappa_scores_mean': 0.3,
    'covariance_matrix_mean': -0.2,
    'average_doctor_deviation': -0.25,
    'focus_corr_mean': 0.25,
    'balanced_accuracy': 0.25,
    'recall': 0.25,
    'accuracy': 0.25,
    'std_mean': -0.15,
    'standard_deviation_mean': -0.10,
    'errors_df': -0.2
}

Let's calculate the combined metric as multiplying the weighting coefficients by the normalized metrics, summing them up.

In [ ]:
normalized_df['final_score'] = (normalized_df * pd.Series(weights)).sum(axis=1)
In [ ]:
normalized_df['final_score'].sort_values(ascending=False)
Out[0]:
DoctorNo.3 0.392190
Doctor No.13 0.338965
DoctorNo.1 0.303614
DoctorNo.9 0.294062
Doctor No.5 0.256713
DoctorNo.10 0.217204
DoctorNo.14 0.184836
DoctorNo.11 0.178581
DoctorNo.12 0.070730
DoctorNo.4 0.001668
Doctor No.7 -0.063771
Doctor No.6 -0.157113
DoctorNo.8 -0.179304
Doctor no.2    -0.423276
DoctorNo.15 -0.601389
Name: final_score, dtype: float64

Thus, based on existing metrics, we wrote our own metric, calculated it, and received a "rating" of doctors.

In [ ]:
plt.figure(figsize=(10, 6))
sns.barplot(x=normalized_df['final_score'].sort_values(ascending=False).values, y=normalized_df['final_score'].sort_values(ascending=False).index, palette='coolwarm')
plt.title('Rating of doctors')
plt.xlabel('Rating')
plt.ylabel('Doctor')
plt.show()
/tmp/ipykernel_201/2568690680.py:2: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=normalized_df['final_score'].sort_values(ascending=False).values, y=normalized_df['final_score'].sort_values(ascending=False).index, palette='coolwarm')

As you can see from the graph above, the worst doctors are - 15, 2, 8, 6, 7

Let's form the final markup and calculate the doctors' confidence.

You can take into account the overall rating of doctors, that is, if 14 out of 15 doctors put 1, then there is pathology, etc. However, the rating of doctors can be taken into account.

Let's calculate the final diagnoses

In [ ]:
votes = focus.sum(axis=1)
f = focus * normalized_df['final_score']
final_votes = (f.sum(1) > 0.4).astype(int)
confidence__ = focus.sum(axis=1)
weights_ = {
    'doctor' : 0.4,
    'confidence': 0.6
}

Diagnosis_weights = ((confidence__ / focus.shape[1]) * weights_['confidence'] + (f.sum(1) > 0.4).astype(int) * weights_['doctor'])
Diagnosis = (Diagnosis_weights >= 0.36).astype(int)
In [ ]:
Diagnosis.sum()
Out[0]:
np.int64(25)

Let's calculate the doctors' confidence

In [ ]:
normalized_diagnosis_weights = (Diagnosis_weights - np.min(Diagnosis_weights)) / (np.max(Diagnosis_weights) - np.min(Diagnosis_weights))
confidence = normalized_diagnosis_weights * 100
confidence.sort_values(ascending=False)
Out[0]:
433    100.0
394     92.0
28      88.0
346     88.0
352     88.0
       ...  
472      0.0
3        0.0
474      0.0
475      0.0
1        0.0
Length: 477, dtype: float64

Let's calculate the confidence interval

In [ ]:
mean = np.mean(Diagnosis_weights)
std_err = sem(Diagnosis_weights)

confidence_level = 0.95
n = len(Diagnosis_weights)
h = std_err * t.ppf((1 + confidence_level) / 2, n - 1)

lower_bound = mean - h
upper_bound = mean + h

# Confidence interval output
print(f"Confidence interval: [{lower_bound}, {upper_bound}]")
Confidence interval: [0.07206422989620075, 0.1021915353029607]

Let's analyze the received data.

The pd.Series Diagnosis contains summary labels about the exact diagnosis of a particular image. The diagnosis was calculated based on the rating of the doctors, and the probability of a pathology was also taken into account. This probability for a particular image was calculated as the average of the doctors' marks on the image.

The final probability of whether there is a pathology in the image, based on the above data, was calculated as a weighted sum of the above components
The rating of doctors was calculated with a weight of 0.4, the probability weight based on the average was 0.6. These weights were chosen due to the fact that if 14 out of 15 doctors confirm that there is no pathology, and 1 doctor with the highest rating, for example, 0.4, says that there is a pathology, then the weighted sum will be high and there will be a false positive the result. normalized_diagnosis_weights - weighted sum.

normalized_diagnosis_weights - the certainty of whether there is pathology in the image. She was paying off. Based on this confidence, the final label for the image was selected.

I took the threshold of 0.4, that is, if the weight of the diagnosis is more than 0.4, then there is pathology. The result: 25 images with pathology were revealed.

The confidence interval for the certainty of diagnoses is also calculated, with a 95% probability that the majority of probabilities are in [0.07206422989620075, 0.1021915353029607], that is, the final diagnosis for most of the images is 0

In [ ]:
df = pd.concat([Diagnosis, confidence], axis=1)
df.columns = ['Diagnosis', 'Probability, %']
df['Probability, %'] = df.apply(lambda row: 100 - row['Probability, %'] if row['Diagnosis'] == 0 else row['Probability, %'], axis=1)
df
Out[0]:
Diagnosis Probability, %
0 0 92.0
1 0 100.0
2 0 96.0
3 0 100.0
4 0 88.0
... ... ...
472 0 100.0
473 0 84.0
474 0 100.0
475 0 100.0
476 0 96.0

477 rows × 2 columns

The dataframe above contains the final diagnosis and the percentage probability that the diagnosis is indeed correct. For each picture, the diagnosis and the probability that the diagnosis is correct are recorded.

Let's evaluate the final quality of the markup for each of the pathologies.

In [ ]:
mean_probability = np.mean(Diagnosis_weights)
print(f"Average probability: {mean_probability}")

# Standard deviation of probabilities
std_deviation = np.std(Diagnosis_weights)
print(f"Standard deviation of probabilities: {std_deviation}")

# Maximum and minimum probability
max_probability = np.max(Diagnosis_weights)
min_probability = np.min(Diagnosis_weights)
print(f"Maximum probability: {max_probability}")
print(f"Minimum probability: {min_probability}")
Average probability: 0.08712788259958072
Standard deviation of probabilities: 0.16725534527288208
Maximum probability: 1.0
Minimum probability: 0.0

As you can see, most of the diagnoses are negative, that is, there is no pathology.

In [ ]:
accuracy_d = accuracy_score(diagnosis_major, Diagnosis)
recall_d = recall_score(diagnosis_major, Diagnosis)
balanced_accuracy_d = balanced_accuracy_score(diagnosis_major, Diagnosis)
F1_d = f1_score(diagnosis_major, Diagnosis)
fbeta_d = fbeta_score(diagnosis_major, Diagnosis, beta=0.5)
roc_auc_d = roc_auc_score(diagnosis_major, Diagnosis)
print(f"Accuracy: {accuracy_d}")
print(f"Completeness: {recall_d}")
print(f"Balanced Accuracy: {balanced_accuracy_d}")
print(f"F1: {F1_d}")
print(f"Fbeta: {fbeta_d}")
print(f"roc-auc: {roc_auc_d}")
Accuracy: 0.9622641509433962
Completeness: 0.6206896551724138
Balanced Accuracy: 0.8025323275862069
F1: 0.6666666666666666
Fbeta: 0.6976744186046512
roc-auc: 0.8025323275862069

The high accuracy (96.2%) indicates that most of the predictions are correct.

Completeness (62.1%) indicates that some positive cases are missing.

Balanced accuracy (80.3%) and ROC-AUC (80.3%) - good class discrimination ability.

The F1-score (66.7%) and Fbeta-score (69.8%) indicate a balanced quality of final diagnoses.

In [ ]:
confusion_matrix_ = confusion_matrix(diagnosis_major, Diagnosis)
disp = ConfusionMatrixDisplay(confusion_matrix_)
disp.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
Out[0]:
Text(0.5, 1.0, 'Confusion Matrix')

The error matrix makes it clear that there are not many misdiagnoses.

In [ ]:
precision, recall, _ = precision_recall_curve(diagnosis_major, Diagnosis)
average_precision = average_precision_score(diagnosis_major, Diagnosis)

plt.figure()
plt.step(recall, precision, where='post', color='b', alpha=0.2, linestyle='-', linewidth=2, label='Precision-Recall curve')
plt.fill_between(recall, precision, step='post', alpha=0.2, color='b')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall curve: AP={0:0.2f}'.format(average_precision))
plt.legend(loc="lower left")
Out[0]:
<matplotlib.legend.Legend at 0x7f8fa9817790>

As can be seen from the graph above, accuracy prevails quite a bit over completeness, however, as can be seen from the graph, precision is slightly lower than accuracy, which indicates inaccurate positive diagnoses.

As can be seen from the markup quality assessment above, the markup is quite well executed, there are small gaps. However, it is worth taking into account the fact that the "Reference marks", so to speak targets, calculated as an average score by doctors, may not accurately reflect the actual. This is because some doctors have a low rating, and even the average may be inaccurate. However, it is worth considering the fact that doctors, although low, have consistency, which may indicate the correctness of the consistency of the final markup.

Conclusions

In this example, the analysis of doctors' conclusions was carried out using data analytics methods.