Engee documentation
Notebook

Analysis of medical medical reports

In this demo, medical data will be analyzed using analytics tools.

During the analysis of the dataset, several metrics were calculated for the consistency and accuracy of doctors' work.

Libraries must be installed

You must manually specify the paths where the project is located in order to go to the working directory, as well as install the necessary dependencies.

In [ ]:
!cd /user/Demo_public/biomedical/analis_medical_result
In [ ]:
 !pip install -r /user/Demo_public/biomedical/analis_medical_result/requirements.txt
In [ ]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import cohen_kappa_score
from statsmodels.stats.inter_rater import fleiss_kappa
import numpy as np
from sklearn.metrics import accuracy_score, recall_score, confusion_matrix, balanced_accuracy_score, mean_squared_error
from sklearn.metrics import precision_recall_curve, average_precision_score, f1_score, fbeta_score, roc_auc_score,ConfusionMatrixDisplay, roc_curve, auc
from sklearn.preprocessing import MinMaxScaler
from scipy.stats import sem, t

Let's read the data and see what is contained in the first 5 lines of the dataframe.

In [ ]:
focus = pd.read_csv('/user/Demo_public/biomedical/analis_medical_result/Дата.csv')
In [ ]:
focus.head()
Out[0]:
ID Файла Врач№1 Врач№2 Врач№3 Врач№4 Врач№5 Врач№6 Врач№7 Врач№8 Врач№9 Врач№10 Врач№11 Врач№12 Врач№13 Врач№14 Врач№15
0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
3 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 4 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1
In [ ]:
focus = focus.drop('ID Файла', axis=1)

The files contain the markup of the images. Each of the 15 doctors made a conclusion based on 476 photographs in the absence or presence of pathology.

You can estimate the number of doctors who are confident in pathology for each picture and pre-evaluate where there really is pathology and where there is not. You can also look for strong discrepancies in the doctors' results. It is possible to determine how much doctors agree with each other in the definition of pathology.

We will evaluate the quality of each pathology doctor's markup.

In [ ]:
focus_corr = focus.corr(method='spearman')
filtered = focus_corr.where(focus_corr > 0.5).abs()

As can be seen from the correlation table, there are not many pairs of doctors whose decisions about gynecology coincide by at least 50 percent. There is a weak correlation.

However, there are two pairs of doctors (3 and 13)(12 and 10) which have an average correlation strength. This makes it clear that these two pairs tend to make the same markings for most of the pictures. This may indicate a high level of consistency between them regarding whether there is pathology in the image or not.

In [ ]:
focus_corr_ = focus.corr(method='spearman')
filtered_ = focus_corr_.where(focus_corr < 0.1)

There are also pairs of doctors who give completely opposite answers to each other, which means that their opinions differ by almost 100 percent.

In [ ]:
focus_corr_mean = focus_corr.mean(axis=1).sort_values(ascending=False)
focus_corr_mean
Out[0]:
Врач№1     0.425707
Врач№3     0.394424
Врач№12    0.392694
Врач№10    0.384297
Врач№9     0.384255
Врач№13    0.376615
Врач№14    0.336404
Врач№5     0.328782
Врач№4     0.322743
Врач№11    0.317238
Врач№15    0.316465
Врач№6     0.312953
Врач№2     0.307039
Врач№7     0.297679
Врач№8     0.143105
dtype: float64

The table above shows the average correlation of each doctor with other doctors, this information makes it clear how much each doctor's opinion agrees with others.

In [ ]:
plt.figure(figsize=(8,8))
sns.heatmap(focus_corr.round(2), annot=True, center=0, cmap='inferno')
plt.title('Корреляция между разметками врачей')
Out[0]:
Text(0.5, 1.0, 'Корреляция между разметками врачей')
In [ ]:
plt.figure(figsize=(8,8))
sns.heatmap(filtered.round(2), annot=True, center=0, cmap='inferno')
plt.title('Корреляция между разметками врачей с средней по силе корреляции')
Out[0]:
Text(0.5, 1.0, 'Корреляция между разметками врачей с средней по силе корреляции')

Let's see how many doctors are sure of pathology in percentage terms for each picture.

In [ ]:
confidence = focus.sum(axis=1).sort_values(ascending=False) / focus.shape[1] * 100
confidence.head(20)
Out[0]:
433    100.000000
394     86.666667
28      80.000000
346     80.000000
352     80.000000
359     73.333333
182     73.333333
134     73.333333
113     73.333333
383     73.333333
398     73.333333
153     73.333333
295     66.666667
120     66.666667
342     60.000000
103     60.000000
341     53.333333
387     53.333333
339     53.333333
192     53.333333
dtype: float64

Thus, in the table above, you can observe the percentage probability of pathology in the picture of the corresponding ID, taking into account the decision of the doctors.

In [ ]:
average_marking = focus.mean(axis=1)
In [ ]:
average_doctor_deviation = focus.sub(average_marking, axis=0).abs().mean().sort_values(ascending=False)
average_doctor_deviation
Out[0]:
Врач№15    0.245702
Врач№2     0.211880
Врач№6     0.168553
Врач№7     0.153878
Врач№8     0.145073
Врач№4     0.144095
Врач№12    0.140741
Врач№14    0.124948
Врач№11    0.122991
Врач№10    0.121454
Врач№1     0.117540
Врач№5     0.116841
Врач№9     0.116143
Врач№13    0.110552
Врач№3     0.106918
dtype: float64

Doctor No. 15 has the highest average deviation (0.2457), which means that his markings differ significantly from the average markings for all doctors. This may indicate possible errors in the markup.

Doctor No. 3 has the lowest average deviation (0.1069), which indicates a high consistency of his markings with the average markings for all doctors. This doctor most often marked up the pictures in the same way as most other doctors.

Thus, the average deviation from the average marking for all doctors was obtained here.

Next, we calculate the covariance matrix for the doctors' assessments

In [ ]:
covariance_matrix = focus.cov()
covariance_matrix
Out[0]:
Врач№1 Врач№2 Врач№3 Врач№4 Врач№5 Врач№6 Врач№7 Врач№8 Врач№9 Врач№10 Врач№11 Врач№12 Врач№13 Врач№14 Врач№15
Врач№1 0.094031 0.046544 0.032090 0.029244 0.031869 0.054032 0.030905 0.004118 0.033649 0.046152 0.026346 0.047610 0.033750 0.041832 0.048782
Врач№2 0.046544 0.181157 0.022775 0.038361 0.020176 0.038414 0.016357 0.008676 0.025206 0.047540 0.023492 0.061801 0.028082 0.033111 0.060079
Врач№3 0.032090 0.022775 0.051645 0.022770 0.020017 0.022466 0.026743 0.007183 0.029262 0.021814 0.021431 0.027699 0.030407 0.027047 0.027972
Врач№4 0.029244 0.038361 0.022770 0.107033 0.018313 0.038643 0.020387 0.011500 0.030209 0.040260 0.016780 0.049346 0.020158 0.015248 0.039612
Врач№5 0.031869 0.020176 0.020017 0.018313 0.053512 0.017917 0.020176 0.004955 0.026994 0.023704 0.012883 0.021088 0.019780 0.020572 0.016846
Врач№6 0.054032 0.038414 0.022466 0.038643 0.017917 0.138479 0.016939 -0.003788 0.030896 0.040022 0.020031 0.047073 0.023871 0.026346 0.055983
Врач№7 0.030905 0.016357 0.026743 0.020387 0.020176 0.016939 0.110195 0.015446 0.025673 0.023030 0.033296 0.034023 0.017811 0.023307 0.034168
Врач№8 0.004118 0.008676 0.007183 0.011500 0.004955 -0.003788 0.015446 0.057220 0.003550 0.000172 0.008390 0.009967 0.004827 0.001321 0.007201
Врач№9 0.033649 0.025206 0.029262 0.030209 0.026994 0.030896 0.025673 0.003550 0.073472 0.036084 0.021788 0.038705 0.026826 0.027086 0.037326
Врач№10 0.046152 0.047540 0.021814 0.040260 0.023704 0.040022 0.023030 0.000172 0.036084 0.090693 0.016133 0.058731 0.023492 0.025369 0.054225
Врач№11 0.026346 0.023492 0.021431 0.016780 0.012883 0.020031 0.033296 0.008390 0.021788 0.016133 0.064531 0.025540 0.023241 0.019542 0.021524
Врач№12 0.047610 0.061801 0.027699 0.049346 0.021088 0.047073 0.034023 0.009967 0.038705 0.058731 0.025540 0.125478 0.029183 0.023691 0.067874
Врач№13 0.033750 0.028082 0.030407 0.020158 0.019780 0.023871 0.017811 0.004827 0.026826 0.023492 0.023241 0.029183 0.055371 0.024602 0.026730
Врач№14 0.041832 0.033111 0.027047 0.015248 0.020572 0.026346 0.023307 0.001321 0.027086 0.025369 0.019542 0.023691 0.024602 0.075234 0.028302
Врач№15 0.048782 0.060079 0.027972 0.039612 0.016846 0.055983 0.034168 0.007201 0.037326 0.054225 0.021524 0.067874 0.026730 0.028302 0.208657
In [ ]:
covariance_matrix_mean = covariance_matrix.mean(axis=1).sort_values(ascending=False)
covariance_matrix_mean
Out[0]:
Врач№15    0.049019
Врач№12    0.044520
Врач№2     0.043451
Врач№1     0.040064
Врач№6     0.037822
Врач№10    0.036495
Врач№4     0.033191
Врач№9     0.031115
Врач№7     0.029897
Врач№14    0.027507
Врач№3     0.026088
Врач№13    0.025875
Врач№11    0.023663
Врач№5     0.021920
Врач№8     0.009383
dtype: float64

Doctor No. 15 has the highest average covariance (0.0525). This suggests that the markup of Doctor No. 15 is, on average, more consistent with the markup of other doctors. This doctor most often notes the presence or absence of pathology in the same way as other doctors.

Doctor No. 8 has the lowest average covariance (0.0101). This indicates that the markup of Doctor No. 8 is, on average, less consistent with the markup of other doctors. This doctor most often evaluates the presence or absence of pathology differently than other doctors.

It can be seen that Doctor No. 15 has the highest average covariance and the highest average deviation. This may be due to the fact that the opinion of the doctor about the pathology in the picture agrees with the majority opinion, however, he may notice hidden features more often than others and overestimate them a little.

In [ ]:
doctor_13 = focus['Врач№13']
doctor_3 = focus['Врач№3']
doctor_12 = focus['Врач№12']
doctor_10 = focus['Врач№10']
In [ ]:
cohen_kappa_score(doctor_13, doctor_3), cohen_kappa_score(doctor_12, doctor_10)
Out[0]:
(np.float64(0.5681836885853017), np.float64(0.5380704515191865))

This confirms the correlation table of the most highly consistent pairs according to the estimates of the images.

In [ ]:
counts = np.zeros((focus.shape[0], 2), dtype=int)
for idx, row in focus.iterrows():
  counts[idx, 0] = (row == 0).sum()
  counts[idx, 1] = (row == 1).sum()
In [ ]:
fleiss_kappa(counts)
Out[0]:
np.float64(0.2591132582884047)

This test shows consistency in the annotations of all doctors. As can be seen from the result, consistency is low, doctors have disagreements about the presence of pathology.

In [ ]:
kappa_scores = pd.DataFrame(focus.columns, focus.columns)
In [ ]:
for col in focus.columns:
  for col_ in focus.columns:
    kappa_scores.loc[col, col_] = cohen_kappa_score(focus[col], focus[col_])
kappa_scores = kappa_scores.drop([0], axis=1)

Thanks to this sign, you can find out which doctors agree with each other more.

In [ ]:
kappa_scores_mean = kappa_scores.mean(1).sort_values(ascending=False)
kappa_scores_mean
Out[0]:
Врач№1     0.411056
Врач№12    0.374667
Врач№10    0.370726
Врач№9     0.370257
Врач№3     0.369324
Врач№13    0.355202
Врач№14    0.324713
Врач№4     0.311602
Врач№5     0.310791
Врач№11    0.304169
Врач№6     0.296735
Врач№7     0.286314
Врач№2     0.275085
Врач№15    0.270183
Врач№8     0.137958
dtype: float64

It can be seen from this plate that doctor number 15 has one of the lowest average network consistency among doctors.

The Kappa Cohen coefficient, unlike the correlation table, regulates random coincidences

This may be due to the fact that the doctor often agrees with other doctors, but he makes more positive or negative diagnoses.

Let's calculate the standard deviation of the doctor's results from the average for doctors

In [ ]:
standard_deviation_mean=focus.sub(focus.mean(axis=1), axis=0).pow(2).mean().sort_values(ascending=False)
standard_deviation_mean
Out[0]:
Врач№15    0.174125
Врач№2     0.140303
Врач№6     0.096976
Врач№7     0.082301
Врач№8     0.073496
Врач№4     0.072518
Врач№12    0.069164
Врач№14    0.053371
Врач№11    0.051414
Врач№10    0.049877
Врач№1     0.045963
Врач№5     0.045264
Врач№9     0.044566
Врач№13    0.038975
Врач№3     0.035341
dtype: float64

In this case, the MSE from the average of doctors helps to understand which doctors have the highest deviation, who most often gives estimates that are inconsistent with other doctors for a particular snapshot.

In [ ]:
std_mean=focus.std().sort_values(ascending=False)
std_mean
Out[0]:
Врач№15    0.456790
Врач№2     0.425625
Врач№6     0.372128
Врач№12    0.354229
Врач№7     0.331956
Врач№4     0.327159
Врач№1     0.306645
Врач№10    0.301153
Врач№14    0.274288
Врач№9     0.271057
Врач№11    0.254030
Врач№8     0.239208
Врач№13    0.235310
Врач№5     0.231327
Врач№3     0.227254
dtype: float64

The COEX system helps to see the doctors who have the highest variation.

Next, we will get a "Reference score" for each picture, which will be calculated based on the sum of votes, that is, if most of the doctors are in favor of having a pathology, then we set 1, otherwise - 0.

In [ ]:
diagnosis_major = pd.Series([0 if x > y else 1 for x, y in zip(counts[:, 0], counts[:, 1])])
In [ ]:
accuracy_df = focus.apply(lambda x: accuracy_score(diagnosis_major, x)).sort_values(ascending=False)
accuracy_df
Out[0]:
Врач№3     0.955975
Врач№5     0.949686
Врач№13    0.947589
Врач№9     0.943396
Врач№1     0.943396
Врач№14    0.932914
Врач№11    0.928721
Врач№10    0.926625
Врач№12    0.893082
Врач№4     0.888889
Врач№7     0.884696
Врач№8     0.882600
Врач№6     0.870021
Врач№2     0.807128
Врач№15    0.761006
dtype: float64

The table above shows the accuracy of the assessments of doctors' images in comparisons with reference estimates that were obtained on the basis of the majority concept, that is, the more confirmations of doctors' diagnoses a particular image has, the higher the probability that it is indeed a pathology.

In [ ]:
recall_df = focus.apply(lambda x: recall_score(diagnosis_major, x)).sort_values(ascending=False)
recall_df
Out[0]:
Врач№15    0.965517
Врач№1     0.896552
Врач№2     0.862069
Врач№12    0.827586
Врач№6     0.793103
Врач№10    0.724138
Врач№9     0.689655
Врач№14    0.620690
Врач№7     0.586207
Врач№4     0.586207
Врач№3     0.586207
Врач№5     0.551724
Врач№13    0.551724
Врач№11    0.482759
Врач№8     0.034483
dtype: float64

Recall can be used to assess how often the doctor correctly identifies the presence of pathologies in the images.

In [ ]:
columns = focus.columns
confusion_matrices = {}
for column in columns:
  confusion_matrix_ = confusion_matrix(diagnosis_major, focus[column])
  confusion_matrices[column] = confusion_matrix_.sum() - np.diag(confusion_matrix_).sum()
In [ ]:
errors_df = pd.DataFrame(list(confusion_matrices.items()), columns=['Врач', 'Количество ошибок']).sort_values(ascending=False, by='Количество ошибок')
errors_df
Out[0]:
Врач Количество ошибок
14 Врач№15 114
1 Врач№2 92
5 Врач№6 62
7 Врач№8 56
6 Врач№7 55
3 Врач№4 53
11 Врач№12 51
9 Врач№10 35
10 Врач№11 34
13 Врач№14 32
0 Врач№1 27
8 Врач№9 27
12 Врач№13 25
4 Врач№5 24
2 Врач№3 21

By counting FP + FN, we can see which of the doctors has the most errors.

In [ ]:
counts[:, 0].sum(), counts[:, 1].sum()
Out[0]:
(np.int64(6316), np.int64(839))

(6316, 839) - the total number of doctors' diagnoses. 6316 - the number of votes of doctors for the absence of pathology, 839 - the presence. As you can see, there is a slight class imbalance, let's try to apply accuracy_balanced from sklearn.

In [ ]:
balanced_df = focus.apply(lambda x: balanced_accuracy_score(diagnosis_major, x)).sort_values(ascending=False)
balanced_df
Out[0]:
Врач№1     0.921490
Врач№12    0.862454
Врач№15    0.856643
Врач№6     0.834052
Врач№2     0.832820
Врач№10    0.831935
Врач№9     0.824738
Врач№14    0.786907
Врач№3     0.783059
Врач№5     0.763585
Врач№13    0.762469
Врач№4     0.747345
Врач№7     0.745112
Врач№11    0.720174
Врач№8     0.485991
dtype: float64

Let's build graphs that visually display information about which doctor coped best with the diagnosis. A total of 10 metrics were calculated, and we will build 10 graphs.

In [ ]:
colors = ['blue', 'green', 'red', 'purple', 'orange', 'yellow', 'brown', 'pink', 'gray', 'cyan', 'magenta', 'lime', 'teal', 'navy']

fig, axes = plt.subplots(nrows=5, ncols=2,figsize=(13, 24))
axes = axes.flatten()
axes[0].bar(focus_corr_mean.index.tolist(), focus_corr_mean.values.tolist(), color=colors[0])
axes[0].set_title('Средняя корреляция по врачам')
axes[0].set_xlabel('Врач')
axes[0].set_ylabel('Средняя корреляция')
axes[0].tick_params(axis='x', rotation=45)
axes[1].bar(average_doctor_deviation.index.tolist(), average_doctor_deviation.values.tolist(), color=colors[1])
axes[1].set_title('Среднее отклонение по врачам')
axes[1].set_xlabel('Врач')
axes[1].set_ylabel('Среднее отклонение')
axes[1].tick_params(axis='x', rotation=45)
axes[2].bar(covariance_matrix_mean.index.tolist(), covariance_matrix_mean.values.tolist(), color=colors[2])
axes[2].set_title('Средняя ковариация по врачам')
axes[2].set_xlabel('Врач')
axes[2].set_ylabel('Средняя ковариация')
axes[2].tick_params(axis='x', rotation=45)
axes[3].bar(kappa_scores_mean.index.tolist(), kappa_scores_mean.values.tolist(), color=colors[3])
axes[3].set_title('Средняя оценка по Капо-Коэно по врачам')
axes[3].set_xlabel('Врач')
axes[3].set_ylabel('Средняя оценка по Капо-Коэно')
axes[3].tick_params(axis='x', rotation=45)
axes[4].bar(standard_deviation_mean.index.tolist(), standard_deviation_mean.values.tolist(), color=colors[4])
axes[4].set_title('Среднеквадратическое отклонение по врачам')
axes[4].set_xlabel('Врач')
axes[4].set_ylabel('Среднеквадратическое отклонение')
axes[4].tick_params(axis='x', rotation=45)
axes[5].bar(std_mean.index.tolist(), std_mean.values.tolist(), color=colors[5])
axes[5].set_title('Среднее СКО по врачам')
axes[5].set_xlabel('Врач')
axes[5].set_ylabel('Среднее СКО')
axes[5].tick_params(axis='x', rotation=45)
axes[6].bar(accuracy_df.index.tolist(), accuracy_df.values.tolist(), color=colors[6])
axes[6].set_title('Мажоритарная точность по врачам')
axes[6].set_xlabel('Врач')
axes[6].set_ylabel('Мажоритарная точность')
axes[6].tick_params(axis='x', rotation=45)
axes[7].bar(recall_df.index.tolist(), recall_df.values.tolist(), color=colors[7])
axes[7].set_title('Мажоритарная полнота по врачам')
axes[7].set_xlabel('Врач')
axes[7].set_ylabel('Мажоритарная полнота')
axes[7].tick_params(axis='x', rotation=45)
axes[8].bar(errors_df['Врач'], errors_df['Количество ошибок'], color=colors[8])
axes[8].set_title('Ошибки по врачам')
axes[8].set_xlabel('Врач')
axes[8].set_ylabel('ошибки по полнота')
axes[8].tick_params(axis='x', rotation=45)
axes[9].bar(balanced_df.index.tolist(), balanced_df.values.tolist(), color=colors[9])
axes[9].set_title('Мажоритарная сбалансированная точность по врачам')
axes[9].set_xlabel('Врач')
axes[9].set_ylabel('Мажоритарная сбалансированная точность')
axes[9].tick_params(axis='x', rotation=45)
plt.subplots_adjust(hspace=0.6)
plt.tight_layout()

Let's analyze the graphs: doctors who have a high average correlation among doctors most often have the same opinion about the image as other doctors. High covariance is responsible for variability, that is, the higher this parameter, the more the results of a particular doctor differ from the opinion of all doctors on the image. The Copa-Cohen score helps to identify a doctor's alignment with other doctors' decisions. This score, unlike the average correlation, takes into account the randomness of medical decisions.

The average deviation, the standard deviation, helps to see how much the opinion of a particular doctor deviates from the average opinion of doctors, but the COE makes it clear numerically the variation in the average estimates of doctors, that is, roughly speaking, how far the opinion of a doctor is from the average.

The majority rating of each image was calculated, that is, pathology was evaluated based on the opinion of the majority. We will consider this a benchmark. Accuracy shows how well a doctor can correctly make a diagnosis (absence and presence of pathology), but completeness shows how well a doctor finds pathology in cases where it really exists. The balanced accuracy was calculated under the conditions that there are several times more cases predicted by doctors when there is no pathology than when there is, that is, a slight imbalance.

In [ ]:
fig, axes = plt.subplots(figsize=(7, 7))
axes.bar(accuracy_df.index.tolist(), accuracy_df.values.tolist(), color=colors[6], alpha=1, label='Мажоритарная точность')
axes.set_title('Мажоритарная точность и сбалансированная точность по врачам')
axes.set_xlabel('Врач')
axes.set_ylabel('Мажоритарная точность, сбалансированная точность')
axes.tick_params(axis='x', rotation=45)
axes.bar(balanced_df.index.tolist(), balanced_df.values.tolist(), color=colors[9], alpha=0.5, label='Мажоритарная сбалансированная точность')
axes.legend()
axes.set_ylim(0, 1.25)
plt.tight_layout()
In [ ]:
plt.figure(figsize=(8, 6))
sns.boxplot(data=confidence, orient='v', color='lightblue')
plt.ylabel('Вероятность наличия патологии (%)')
plt.title('Распределение вероятностей наличия патологии')
plt.tight_layout()
plt.show()

As you can see from the graph above, most of the images do not have a 100 percent guarantee that there is pathology there. In just one picture, all the doctors said there was a pathology.

Let's try to create a single metric that combines all of the above

Let's select the worst doctors for each of the pathologies.

In [ ]:
metrics_df = pd.concat([focus_corr_mean, average_doctor_deviation, covariance_matrix_mean, kappa_scores_mean, standard_deviation_mean, std_mean, accuracy_df, recall_df, balanced_df, errors_df.set_index('Врач')['Количество ошибок']], axis=1)
metrics_df.columns=['focus_corr_mean', 'average_doctor_deviation', 'covariance_matrix_mean', 'kappa_scores_mean', 'standard_deviation_mean', 'std_mean', 'accuracy_df', 'recall_df', 'balanced_df', 'errors_df']
In [ ]:
metrics_df
Out[0]:
focus_corr_mean average_doctor_deviation covariance_matrix_mean kappa_scores_mean standard_deviation_mean std_mean accuracy_df recall_df balanced_df errors_df
Врач№1 0.425707 0.117540 0.040064 0.411056 0.045963 0.306645 0.943396 0.896552 0.921490 27
Врач№3 0.394424 0.106918 0.026088 0.369324 0.035341 0.227254 0.955975 0.586207 0.783059 21
Врач№12 0.392694 0.140741 0.044520 0.374667 0.069164 0.354229 0.893082 0.827586 0.862454 51
Врач№10 0.384297 0.121454 0.036495 0.370726 0.049877 0.301153 0.926625 0.724138 0.831935 35
Врач№9 0.384255 0.116143 0.031115 0.370257 0.044566 0.271057 0.943396 0.689655 0.824738 27
Врач№13 0.376615 0.110552 0.025875 0.355202 0.038975 0.235310 0.947589 0.551724 0.762469 25
Врач№14 0.336404 0.124948 0.027507 0.324713 0.053371 0.274288 0.932914 0.620690 0.786907 32
Врач№5 0.328782 0.116841 0.021920 0.310791 0.045264 0.231327 0.949686 0.551724 0.763585 24
Врач№4 0.322743 0.144095 0.033191 0.311602 0.072518 0.327159 0.888889 0.586207 0.747345 53
Врач№11 0.317238 0.122991 0.023663 0.304169 0.051414 0.254030 0.928721 0.482759 0.720174 34
Врач№15 0.316465 0.245702 0.049019 0.270183 0.174125 0.456790 0.761006 0.965517 0.856643 114
Врач№6 0.312953 0.168553 0.037822 0.296735 0.096976 0.372128 0.870021 0.793103 0.834052 62
Врач№2 0.307039 0.211880 0.043451 0.275085 0.140303 0.425625 0.807128 0.862069 0.832820 92
Врач№7 0.297679 0.153878 0.029897 0.286314 0.082301 0.331956 0.884696 0.586207 0.745112 55
Врач№8 0.143105 0.145073 0.009383 0.137958 0.073496 0.239208 0.882600 0.034483 0.485991 56

Let's summarize the metrics

In [ ]:
scaler = MinMaxScaler()
normalized_df = pd.DataFrame(scaler.fit_transform(metrics_df), index=metrics_df.index, columns=metrics_df.columns)
In [ ]:
normalized_df
Out[0]:
focus_corr_mean average_doctor_deviation covariance_matrix_mean kappa_scores_mean standard_deviation_mean std_mean accuracy_df recall_df balanced_df errors_df
Врач№1 1.000000 0.076536 0.774068 1.000000 0.076536 0.345876 0.935484 0.925926 1.000000 0.064516
Врач№3 0.889305 0.000000 0.421469 0.847191 0.000000 0.000000 1.000000 0.592593 0.682131 0.000000
Врач№12 0.883183 0.243706 0.886512 0.866755 0.243706 0.553179 0.677419 0.851852 0.864440 0.322581
Врач№10 0.853470 0.104733 0.684026 0.852324 0.104733 0.321947 0.849462 0.740741 0.794362 0.150538
Врач№9 0.853321 0.066465 0.548299 0.850607 0.066465 0.190834 0.935484 0.703704 0.777837 0.064516
Врач№13 0.826287 0.026183 0.416106 0.795481 0.026183 0.035093 0.956989 0.555556 0.634853 0.043011
Врач№14 0.683997 0.129909 0.457279 0.683842 0.129909 0.204907 0.881720 0.629630 0.690969 0.118280
Врач№5 0.657025 0.071501 0.316315 0.632860 0.071501 0.017741 0.967742 0.555556 0.637416 0.032258
Врач№4 0.635656 0.267875 0.600673 0.635830 0.267875 0.435245 0.655914 0.592593 0.600124 0.344086
Врач№11 0.616178 0.115811 0.360295 0.608615 0.115811 0.116653 0.860215 0.481481 0.537734 0.139785
Врач№15 0.613442 1.000000 1.000000 0.484167 1.000000 1.000000 0.000000 1.000000 0.851096 1.000000
Врач№6 0.601015 0.444109 0.717502 0.581393 0.444109 0.631160 0.559140 0.814815 0.799222 0.440860
Врач№2 0.580087 0.756294 0.859540 0.502117 0.756294 0.864227 0.236559 0.888889 0.796394 0.763441
Врач№7 0.546967 0.338369 0.517571 0.543235 0.338369 0.456147 0.634409 0.592593 0.594998 0.365591
Врач№8 0.000000 0.274924 0.000000 0.000000 0.274924 0.052077 0.623656 0.000000 0.000000 0.376344

Let's set a weighting factor for each of the metrics

In [ ]:
weights = {
    'kappa_scores_mean': 0.3,
    'covariance_matrix_mean': -0.2,
    'average_doctor_deviation': -0.25,
    'focus_corr_mean': 0.25,
    'balanced_accuracy': 0.25,
    'recall': 0.25,
    'accuracy': 0.25,
    'std_mean': -0.15,
    'standard_deviation_mean': -0.10,
    'errors_df': -0.2
}

Let's calculate the combined metric as multiplying the weighting coefficients by the normalized metrics, summing them up

In [ ]:
normalized_df['final_score'] = (normalized_df * pd.Series(weights)).sum(axis=1)
In [ ]:
normalized_df['final_score'].sort_values(ascending=False)
Out[0]:
Врач№3     0.392190
Врач№13    0.338965
Врач№1     0.303614
Врач№9     0.294062
Врач№5     0.256713
Врач№10    0.217204
Врач№14    0.184836
Врач№11    0.178581
Врач№12    0.070730
Врач№4     0.001668
Врач№7    -0.063771
Врач№6    -0.157113
Врач№8    -0.179304
Врач№2    -0.423276
Врач№15   -0.601389
Name: final_score, dtype: float64

Thus, based on existing metrics, we wrote our own metric, calculated it, and received a "rating" of doctors.

In [ ]:
plt.figure(figsize=(10, 6))
sns.barplot(x=normalized_df['final_score'].sort_values(ascending=False).values, y=normalized_df['final_score'].sort_values(ascending=False).index, palette='coolwarm')
plt.title('Рейтинг врачей')
plt.xlabel('Рейтинг')
plt.ylabel('Врач')
plt.show()
/tmp/ipykernel_201/2568690680.py:2: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x=normalized_df['final_score'].sort_values(ascending=False).values, y=normalized_df['final_score'].sort_values(ascending=False).index, palette='coolwarm')

As you can see from the graph above, the worst doctors are - 15, 2, 8, 6, 7

Let's form the final markup and calculate the doctors' confidence.

You can take into account the overall rating of doctors, that is, if 14 out of 15 doctors put 1, then there is pathology, etc. However, the rating of doctors can be taken into account.

Let's calculate the final diagnoses

In [ ]:
votes = focus.sum(axis=1)
f = focus * normalized_df['final_score']
final_votes = (f.sum(1) > 0.4).astype(int)
confidence__ = focus.sum(axis=1)
weights_ = {
    'doctor' : 0.4,
    'confidence': 0.6
}

Diagnosis_weights = ((confidence__ / focus.shape[1]) * weights_['confidence'] + (f.sum(1) > 0.4).astype(int) * weights_['doctor'])
Diagnosis = (Diagnosis_weights >= 0.36).astype(int)
In [ ]:
Diagnosis.sum()
Out[0]:
np.int64(25)

Let's calculate the doctors' confidence

In [ ]:
normalized_diagnosis_weights = (Diagnosis_weights - np.min(Diagnosis_weights)) / (np.max(Diagnosis_weights) - np.min(Diagnosis_weights))
confidence = normalized_diagnosis_weights * 100
confidence.sort_values(ascending=False)
Out[0]:
433    100.0
394     92.0
28      88.0
346     88.0
352     88.0
       ...  
472      0.0
3        0.0
474      0.0
475      0.0
1        0.0
Length: 477, dtype: float64

Let's calculate the confidence interval

In [ ]:
mean = np.mean(Diagnosis_weights)
std_err = sem(Diagnosis_weights)

confidence_level = 0.95
n = len(Diagnosis_weights)
h = std_err * t.ppf((1 + confidence_level) / 2, n - 1)

lower_bound = mean - h
upper_bound = mean + h

# Вывод доверительного интервала
print(f"Доверительный интервал: [{lower_bound}, {upper_bound}]")
Доверительный интервал: [0.07206422989620075, 0.1021915353029607]

Let's analyze the received data.

The pd.Series Diagnosis contains summary labels about the exact diagnosis of a particular image. The diagnosis was calculated based on the rating of the doctors, and the probability of a pathology was also taken into account. This probability for a particular image was calculated as the average of the doctors' marks on the image.

The final probability of whether there is a pathology in the image, based on the above data, was calculated as a weighted sum of the above components
The rating of doctors was calculated with a weight of 0.4, the probability weight based on the average was 0.6. These weights were chosen due to the fact that if 14 out of 15 doctors confirm that there is no pathology, and 1 doctor with the highest rating, for example, 0.4, says that there is a pathology, then the weighted sum will be high and there will be a false positive the result. normalized_diagnosis_weights - weighted sum.

normalized_diagnosis_weights - the certainty of whether there is pathology in the image. She was paying off. Based on this confidence, the final label for the image was selected.

I took the threshold of 0.4, that is, if the weight of the diagnosis is more than 0.4, then there is pathology. The result: 25 images with pathology were revealed.

The confidence interval for the certainty of diagnoses is also calculated, with a 95% probability that the majority of probabilities are in [0.07206422989620075, 0.1021915353029607], that is, the final diagnosis for most of the images is 0

In [ ]:
df = pd.concat([Diagnosis, confidence], axis=1)
df.columns = ['Diagnosis', 'Probability, %']
df['Probability, %'] = df.apply(lambda row: 100 - row['Probability, %'] if row['Diagnosis'] == 0 else row['Probability, %'], axis=1)
df
Out[0]:
Diagnosis Probability, %
0 0 92.0
1 0 100.0
2 0 96.0
3 0 100.0
4 0 88.0
... ... ...
472 0 100.0
473 0 84.0
474 0 100.0
475 0 100.0
476 0 96.0

477 rows × 2 columns

The dataframe above contains the final diagnosis and the percentage probability that the diagnosis is indeed correct. For each picture, the diagnosis and the probability that the diagnosis is correct are recorded.

Let's evaluate the final quality of the markup for each of the pathologies.

In [ ]:
mean_probability = np.mean(Diagnosis_weights)
print(f"Средняя вероятность: {mean_probability}")

# Стандартное отклонение вероятностей
std_deviation = np.std(Diagnosis_weights)
print(f"Стандартное отклонение вероятностей: {std_deviation}")

# Максимальная и минимальная вероятность
max_probability = np.max(Diagnosis_weights)
min_probability = np.min(Diagnosis_weights)
print(f"Максимальная вероятность: {max_probability}")
print(f"Минимальная вероятность: {min_probability}")
Средняя вероятность: 0.08712788259958072
Стандартное отклонение вероятностей: 0.16725534527288208
Максимальная вероятность: 1.0
Минимальная вероятность: 0.0

As you can see, most of the diagnoses are negative, that is, there is no pathology.

In [ ]:
accuracy_d = accuracy_score(diagnosis_major, Diagnosis)
recall_d = recall_score(diagnosis_major, Diagnosis)
balanced_accuracy_d = balanced_accuracy_score(diagnosis_major, Diagnosis)
F1_d = f1_score(diagnosis_major, Diagnosis)
fbeta_d = fbeta_score(diagnosis_major, Diagnosis, beta=0.5)
roc_auc_d = roc_auc_score(diagnosis_major, Diagnosis)
print(f"Точность: {accuracy_d}")
print(f"Полнота: {recall_d}")
print(f"Сбалансированная Точность: {balanced_accuracy_d}")
print(f"F1: {F1_d}")
print(f"Fbeta: {fbeta_d}")
print(f"roc-auc: {roc_auc_d}")
Точность: 0.9622641509433962
Полнота: 0.6206896551724138
Сбалансированная Точность: 0.8025323275862069
F1: 0.6666666666666666
Fbeta: 0.6976744186046512
roc-auc: 0.8025323275862069

The high accuracy (96.2%) indicates that most of the predictions are correct.

Completeness (62.1%) indicates that some positive cases are missing.

Balanced accuracy (80.3%) and ROC-AUC (80.3%) - good class discrimination ability.

The F1-score (66.7%) and Fbeta-score (69.8%) indicate a balanced quality of final diagnoses.

In [ ]:
confusion_matrix_ = confusion_matrix(diagnosis_major, Diagnosis)
disp = ConfusionMatrixDisplay(confusion_matrix_)
disp.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix')
Out[0]:
Text(0.5, 1.0, 'Confusion Matrix')

The error matrix makes it clear that there are not many misdiagnoses.

In [ ]:
precision, recall, _ = precision_recall_curve(diagnosis_major, Diagnosis)
average_precision = average_precision_score(diagnosis_major, Diagnosis)

plt.figure()
plt.step(recall, precision, where='post', color='b', alpha=0.2, linestyle='-', linewidth=2, label='Precision-Recall curve')
plt.fill_between(recall, precision, step='post', alpha=0.2, color='b')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall curve: AP={0:0.2f}'.format(average_precision))
plt.legend(loc="lower left")
Out[0]:
<matplotlib.legend.Legend at 0x7f8fa9817790>

As can be seen from the graph above, accuracy prevails quite a bit over completeness, however, as can be seen from the graph, precision is slightly lower than accuracy, which indicates inaccurate positive diagnoses.

As can be seen from the markup quality assessment above, the markup is quite well executed, there are small gaps. However, it is worth taking into account the fact that the "Reference marks", so to speak targets, calculated as an average score by doctors, may not accurately reflect the actual. This is because some doctors have a low rating, and even the average may be inaccurate. However, it is worth considering the fact that doctors, although low, have consistency, which may indicate the correctness of the consistency of the final markup.

Conclusions

In this example, the analysis of doctors' conclusions was carried out using data analytics methods.