Extended Probability & Statistics

Heart Disease Data Analysis

Data analysis is a process of extracting, presenting, and modeling based on information retrieved from raw sources. In this example, a workflow of performing data analysis in the Wolfram Language is showcased. The dataset used here comes from the UCI Machine Learning Repository, which consists of heart disease diagnosis data from 1,541 patients.

Import heart disease diagnosis data and parse it so the rows correspond to different patients, and the columns correspond to different attributes.

Click for copyable input
rawdata = Import["https://archive.ics.uci.edu/ml/machine-learning-databases/\ heart-disease/new.data", "Text"]; data = StringSplit[rawdata, LetterCharacter ..]; data = Table[ ToExpression[StringSplit[dat, (" " | "\n") ..]], {dat, data}];

Extract the relevant attributes into "labels" and "features". The values stored in "labels" are 0 and 1, which correspond to presence and absence of heart disease, respectively.

Click for copyable input
labels = Unitize[data[[All, 58]]]; features = data[[All, {3, 4, 9, 10, 12, 16, 19, 32, 38, 40, 41, 44, 51}]];
Click for copyable input
Take[labels, 10]

For each patient, the feature vector is a list of numerical values. However, the data is not complete and has missing fields stored as .

Click for copyable input

Replace missing values by the average of the available data in the corresponding attribute, then visualize the correlation between different attributes.

Click for copyable input
features = Transpose[Table[ N[attribute /. {-9 -> Mean[N[DeleteCases[attribute, -9]]]}] , {attribute, Transpose[features]}]]; cormat = Correlation[features];
show complete Wolfram Language input
Click for copyable input
MatrixPlot[cormat, ImageSize -> Medium, PlotTheme -> "Detailed"]

To visualize the distribution of the data, PCA is performed to extract the first two leading components, then the projected data is presented on a scatter plot.

Click for copyable input
pcs2 = Take[PrincipalComponents[features, Method -> "Correlation"], All, 2];
show complete Wolfram Language input
Click for copyable input
splot = With[{ind = Pick[Range[Length[pcs2]], labels, 0]}, ListPlot[{pcs2[[ind]], Delete[pcs2, Transpose[{ind}]]}, ImageSize -> Medium, PlotTheme -> "Detailed", PlotMarkers -> Automatic, PlotLegends -> {"Absence", "Presence"}] ]

To distinguish the two classes, the projected data is fitted to a two-component Gaussian mixture model.

Click for copyable input
edist = EstimatedDistribution[pcs2, MixtureDistribution[{p1, p2}, {BinormalDistribution[{m11, m12}, {s11, s12}, r1], BinormalDistribution[{m21, m22}, {s21, s22}, r2]}]];

Based on the mixture model, plot the decision boundary (black curve) and probability density contours (red curve) of the mixture model and show them together with the scatter plot. The first component of the Gaussian mixture has higher probability inside the decision boundary.

show complete Wolfram Language input
Click for copyable input
Module[{p = edist[[1, 1]], dist1 = edist[[2, 1]], dist2 = edist[[2, 2]], prob}, prob[x_, y_] := p PDF[dist1, {x, y}]/(p PDF[dist1, {x, y}] + (1 - p) PDF[dist2, {x, y}]); Show[splot, ContourPlot[ PDF[edist, {x, y}] == {0.1, 0.05, 0.0125, 0.003, 0.0001}, {x, -4, 6}, {y, -4, 6}, ImageSize -> Medium, PlotTheme -> "Detailed", ContourStyle -> {Dashed, Thick, Red}, PlotLegends -> LineLegend[{Directive[Red, Dashed]}, {"Probability Density"}], PlotPoints -> 25], ContourPlot[prob[x, y] == 1/2, {x, -4, 6}, {y, -4, 6}, ImageSize -> Medium, PlotTheme -> "Detailed", ContourStyle -> {Thickness[.01], Black}, PlotLegends -> LineLegend[{Directive[Black, AbsoluteDashing[{1, 1}]]}, {"Decision Boundary"}], PlotPoints -> 25]] ]

Related Examples

de es fr ja ko pt-br ru