admin@mazurekgravity.in

breast cancer dat this data come from…….

Question-AnswerCategory: Computer Sciencebreast cancer dat this data come from…….
abdul asked 2 months ago
  1. Breast Cancer Data This data set comes from: https://www.kaggle.com/uciml/breast-cancer-Wisconsin-data

After a biopsy of a tumor tissue tests are run on the tumor cells to determine a diagnosis of “benign” or “malignant”. The tests result in 30 different cell attribute measurement values. Some or the measured aspects are “radius mean”, “perimeter mean”, “area mean” which measure the mean value of cell radius (distance from center point), perimeter, and area. Looking at the web page and the data set you can see the other 27 different values. Based on these 30 values, a formula is applied to determine with there tumor is malignant or benign.
The two following files show the data:

  • breastCancerDataReducedDimensions.cvs: Only the first 4 attributes (you can just read this file instead of the file containing the entire set) • breastCancerData.csv: The full data set

Note, the first column is the sample id. The second column is the diagnosis for the sample, where “M” means malignant and “B” means benign.
At lunch one day, you and a medical technician come up with the idea that all this data and complicated formula are not needed. Instead, you decide you just need to look at the first four metrics {radius, texture, perimeter, area} means. The process is as follows:

  1. a) strip the data to only consider those 4 values
  2. b) Create four data files: q3_gte_13: third attribute – those data samples whose radius value is >= 13 q4_gte_18: fourth attribute – those data samples whose texture value is >= 18 q5_gte_85: fifth attribute – those data samples whose perimeter value is >= 85 q6_gte_500: sixth attribute – those data samples whose area value is >= 500
  3. c) Find the data ids that are in each of these four files. The idea is that if a data sample exceeds the threshold (13, 18, 85, and 500) for each of these 4 attributes then the tumor is malignant. If the data does not exceed any of these attributes, then the tumor is benign. If the tumor exceeds some, but not all, of these thresholds then the tumor could be either benign or malignant.

To do this, you need to take the intersection of 4 files where the files have the ids of these data sets.
The four files (q3_gte_13, q4_gte_18, q5_gte_85, q6_gte_500) also have the diagnosis of “B” or “M” from the original test included. You want to test the quality of your process. To do this create 2 versions for each of the 4 dimensions: q3_B, q3_M Where B means the data has been diagnosed as Benign and M means it was diagnosed as Malignant.
Likewise for columns 4, 5, and 6 giving files: q4_B, q4_M q5_B, q5_M q6_B, q6_M
With these files you can now test how well your new easier process works.
Let file NewResult contain the intersection of ids from the four files (q3_gte_13, q4_gte_18, q5_gte_85, q6_gte_500), i.e. regardless of whether the original methods said M or B.
There are two ways to test this new process.
Method 1: 
Compare NewResult to the data found in the 4 files you created of ids of M data: q3_M, q4_M, q5_M, and q6_M. Let SubsetMResult contain the data that is the union of the four files (q3_M, q4_M, q5_M, and q6_M). Then, calculate:
Difference _1 = SubsetMResult – NewResult
If your new method is capture all the same data, then Difference should be the empty set.
Method 2:
From the original data set, breastCancerData.csv, find all the ids that are marked M. Call this set OriginalResult.
Difference_2 = OriginalResult – NewResult
Again, if your new method is capture all the same data, then Difference should be the empty set.
To turn in:  contents of difference 1 (in sorted order) contents of difference 2 (in sorted order) A short (4 sentence) answer to Ql: What does the value of set difference_1 obtained from method 1 tell you about how well your new method works? A short (4 sentence) answer to Q2: What does the value of set difference_2 obtained from method 1 tell you about how well your new method works? A short (4 sentence) answer to Q3: Why are difference_1 and difference_2 not the same?

Your Answer