Главная
страница 1

Министерство образования Российской Федерации

МОСКОВСКИЙ ГОСУДАРСТВЕННЫЙ ИНСТИТУТ

ЭЛЕКТРОНИКИ И МАТЕМАТИКИ (технический университет)

ОТЧЕТ О ЛАБОТАРОРНОЙ РАБОТЕ

По дисциплине

Методы и средства анализа данных

по теме:

«Система анализа данных WEKA»


Руководители темы ______________ И. Игнатьев



подпись, дата

______________ А. Грунау

подпись, дата
Исполнитель ____________ А.Кудрявцева

подпись, дата Группа С-74



СОДЕРЖАНИЕ

СОДЕРЖАНИЕ 2

ВВЕДЕНИЕ 3

ОСНОВНАЯ ЧАСТЬ 4

ЗАДАНИЕ1: Модифицировать исходный файл 4

ЗАДАНИЕ2: Классификация исходных данных 5

Метод Naive Baye 5

Метод J4.8 (модификация С4.5) 8

Метод ID3 10

Метод 1R 24

Метод SVM 26

ЗАДАНИЕ3: Построение ассоциативных правил 29

Метод Априори 29

ЗАКЛЮЧЕНИЕ 33

НАБОР ДАННЫХ 33


ВВЕДЕНИЕ


Лабораторная работа посвящена анализу данных в системе анализа данных Weka. Эта система написана на Java и представляет собой систему библиотек функции обработки данных, плюс несколько графических интерфейсов к этим библиотекам. Основной интерфейс системы - Explorer. Он позволяет выполнять практически все действия, которые предусмотрены в системе. Именно в нем мы будем работать. Также в системе Weka предусмотрены другие интерфейсы - Knowledge Flow для работы с большими массивами данных (Explorer загружает все дынные в память сразу, и потому работа с большими массивами затруднена) и Experimenter для экспериментального подбора наилучшего метода анализа данных.

В данной лабораторной работе изучаются методы классификации и поиска ассоциативных правил. Для лучшего понимания различий между данными методами используется система анализа данных Weka, в которой все исследуемые методы применяются к одному и тому же набору исходных данных, а полученные результаты анализируются и сравниваются между собой.

ОСНОВНАЯ ЧАСТЬ

ЗАДАНИЕ 1: Модифицировать исходный файл таблицы. Сохранить его в формате .arff.

Модификация состоит в добавлении полей метаданных: в начало файла на отдельных строчках названия зависимости @relation имя, описания атрибутов @attribute имя тип и @data перед началом самих данных. Типы данных следующие: численные (numeric, real, integer), перечислимые(nominal) (задаются перечислением вида {i1, ..., in}), строковые (string), дата (date [date format]).

В результате в начало файла была добавлена следующая информация:
@relation income

@attribute age numeric

@attribute workclass {Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked}

@attribute fnlwgt numeric

@attribute education {Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool}

@attribute education-num numeric

@attribute marital-status {Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse}

@attribute occupation {Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces}

@attribute relationship {Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried}

@attribute race {White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black}

@attribute sex {Female, Male}

@attribute capital-gain numeric

@attribute capital-loss numeric

@attribute hours-per-week numeric

@attribute native-country {United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands}

@attribute income {>50K, <=50K}

@data
Отношение названо income. Далее перечислены все атрибуты, их типы и принимаемые значения. Например, атрибут age имеет тип numeric (возраст принимает числовые значения). Атрибут workclass имеет тип nominal (в фигурных скобках перечисляются принимаемые значения атрибута). Перед началом самих данных добавлено @data.
ЗАДАНИЕ 2: Классифицировать исходные данные наивным байесовским методом, методом J4.8 (модификация С4.5), методом ID3, методом 1R, методом SVM (в среде Weka он называется SMO). В случае невозможности применить метод к данным воспользоваться фильтрами. Описать полученные результаты.
Оценка точности классификации проводиться при помощи кросс-проверки. Кросс-проверка (Cross-validation) - это процедура оценки точности классификации на данных из тестового множества, которое также называют кросс-проверочным множеством. Точность классификации тестового множества сравнивается с точностью классификации обучающего множества. Если классификация тестового множества дает приблизительно такие же результаты по точности, как и классификация обучающего множества, считается, что данная модель прошла кросс-проверку. Разделение на обучающее и тестовое множества осуществляется путем деления выборки в определенной пропорции, например обучающее множество - две трети данных и тестовое - одна треть данных.
Наивный байесовский метод (NaiveBayes)

"Наивная" классификация - достаточно прозрачный и понятный метод классификации. "Наивной" она называется от наивного предположения, что все рассматриваемые переменные не зависят друг от друга.

Свойства наивной классификации:

1. Использование всех переменных и определение всех зависимостей между ними.

2. Наличие двух предположений относительно переменных:


  • все переменные являются одинаково важными;

  • все переменные являются статистически независимыми, т.е. значение одной переменной ничего не говорит о значении другой.

Идея алгоритма заключается в расчете условной вероятности принадлежности объекта к классу при равенстве его независимых переменных определенным значениям.

Одним из преимуществ данного метода является то, что пропущенные значения не создают никакой проблемы. При подсчете вероятности они просто пропускаются для всех правил, и это не влияет на соотношение вероятностей.


=== Run information ===
Scheme: weka.classifiers.bayes.NaiveBayes

Relation: income

Instances: 400

Attributes: 15

age

workclass



fnlwgt

education

education-num

marital-status

occupation

relationship

race

sex


capital-gain

capital-loss

hours-per-week

native-country

income

Test mode: 10-fold cross-validation


=== Classifier model (full training set) ===
Naive Bayes Classifier
Class >50K: Prior probability = 0.23
age: Normal Distribution. Mean = 45.7806 StandardDev = 11.4273 WeightSum = 91 Precision = 1.2586206896551724

workclass: Discrete Estimator. Counts = 53 9 10 11 7 4 (Total = 94)

fnlwgt: Normal Distribution. Mean = 191613.3811 StandardDev = 104020.9072 WeightSum = 91 Precision = 1368.0227272727273

education: Discrete Estimator. Counts = 24 17 3 21 10 4 6 2 1 2 7 1 4 2 2 1 (Total = 107)

education-num: Normal Distribution. Mean = 10.938 StandardDev = 2.6556 WeightSum = 91 Precision = 1.0714285714285714

marital-status: Discrete Estimator. Counts = 79 7 5 3 2 1 (Total = 97)

occupation: Discrete Estimator. Counts = 7 15 5 11 18 23 1 2 7 3 5 1 3 1 (Total = 102)

relationship: Discrete Estimator. Counts = 12 4 67 9 1 4 (Total = 97)

race: Discrete Estimator. Counts = 79 8 1 3 5 (Total = 96)

sex: Discrete Estimator. Counts = 17 76 (Total = 93)

capital-gain: Normal Distribution. Mean = 5626.3174 StandardDev = 18103.0931 WeightSum = 91 Precision = 3999.96

capital-loss: Normal Distribution. Mean = 208.3642 StandardDev = 636.7895 WeightSum = 91 Precision = 201.71428571428572

hours-per-week: Normal Distribution. Mean = 45.1846 StandardDev = 11.1145 WeightSum = 91 Precision = 2.4358974358974357

native-country: Discrete Estimator. Counts = 77 2 2 1 2 1 4 1 2 1 2 1 1 1 2 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 (Total = 127)

Class <=50K: Prior probability = 0.77
age: Normal Distribution. Mean = 37.3432 StandardDev = 14.9805 WeightSum = 309 Precision = 1.2586206896551724

workclass: Discrete Estimator. Counts = 216 25 10 14 28 5 (Total = 298)

fnlwgt: Normal Distribution. Mean = 180025.5927 StandardDev = 94860.5933 WeightSum = 309 Precision = 1368.0227272727273

education: Discrete Estimator. Counts = 39 76 14 116 2 8 10 12 5 7 12 1 14 3 4 2 (Total = 325)

education-num: Normal Distribution. Mean = 9.3516 StandardDev = 2.2685 WeightSum = 309 Precision = 1.0714285714285714

marital-status: Discrete Estimator. Counts = 99 42 145 11 14 4 (Total = 315)

occupation: Discrete Estimator. Counts = 8 42 45 32 29 33 11 27 34 18 13 5 6 3 (Total = 306)

relationship: Discrete Estimator. Counts = 5 65 90 97 14 44 (Total = 315)

race: Discrete Estimator. Counts = 253 7 10 8 36 (Total = 314)

sex: Discrete Estimator. Counts = 113 198 (Total = 311)

capital-gain: Normal Distribution. Mean = 168.2831 StandardDev = 865.0805 WeightSum = 309 Precision = 3999.96

capital-loss: Normal Distribution. Mean = 45.043 StandardDev = 268.035 WeightSum = 309 Precision = 201.71428571428572

hours-per-week: Normal Distribution. Mean = 38.6196 StandardDev = 12.1111 WeightSum = 309 Precision = 2.4358974358974357

native-country: Discrete Estimator. Counts = 280 1 3 2 2 2 2 1 2 1 1 1 1 3 1 1 1 1 10 1 1 1 4 1 1 1 1 2 1 2 2 1 1 1 2 1 1 1 1 (Total = 343)

Time taken to build model: 0.41 seconds
=== Stratified cross-validation ===

=== Summary ===


Correctly Classified Instances 333 83.25 %

Incorrectly Classified Instances 67 16.75 %

Kappa statistic 0.4677

Mean absolute error 0.1739

Root mean squared error 0.3755

Relative absolute error 49.3473 %

Root relative squared error 89.5762 %

Total Number of Instances 400


=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure Class

0.484 0.065 0.688 0.484 0.568 >50K

0.935 0.516 0.86 0.935 0.896 <=50K
=== Confusion Matrix ===
a b <-- classified as

44 47 | a = >50K

20 289 | b = <=50K
После анализа данных на основе Classifier model можно сделать выводы о людях имеющих заработную плату a = >50K (вероятность = 0.23) или b = <=50K(вероятность = 0.77). Например, средний возраст тех, кто зарабатывает >50K - 45.7806, стандартное отклонение - 11.4273, всего таких людей – 91, точность - 1.2586206896551724 (age: Normal Distribution. Mean = 45.7806 StandardDev = 11.4273 WeightSum = 91 Precision = 1.2586206896551724), а тех кто зарабатывает <=50K средний возраст - 37.3432 (age: Normal Distribution. Mean = 37.3432 StandardDev = 14.9805 WeightSum = 309 Precision = 1.2586206896551724). Таким образом, для значений numeric показывается среднее значение для данного класса, стандартное отклонение, общее количество и точность. Для значений nominal показывается сколько раз встречается какое-то значение в конкретном классе. Например, при заработной плате >50K workclass private имеют 53 человека, Self-emp-not-inc – 9 человек и т.д.; всего -94. (workclass: Discrete Estimator. Counts = 53 9 10 11 7 4 (Total = 94)).

В результате кросс-проверки получаем достаточно высокий процент верной классификации (83,25%), средняя абсолютная ошибка - 0.1739 (Mean absolute error 0.1739)


Метод J4.8 (модификация С4.5)

Принцип алгоритма заключается в рекурсивном разбиении множества объектов из обучающей выборки на подмножества, содержащие объекты, относящиеся к одинаковым классам.

Представляет собой усовершенствованный вариант алгоритма ID3. Среди улучшений стоит отметить следующие:


  • Возможность работать не только с категориальными атрибутами, но также с числовыми. Для этого алгоритм разбивает область значений независимой переменной на несколько интервалов и делит исходное множество на подмножества в соответствии с тем интервалом, в который попадает значение зависимой переменной.

  • После построения дерева происходит усечение его ветвей. Если получившееся дерево слишком велико, выполняется либо группировка нескольких узлов в один лист, либо замещение узла дерева нижележащим поддеревом. Перед операцией над деревом вычисляется ошибка правила классификации, содержащегося в рассматриваемом узле. Если после замещения (или группировки) ошибка не возрастает (и не сильно увеличивается энтропия), значит замену можно произвести без ущерба для построенной модели.

=== Run information ===


Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2

Relation: income

Instances: 400

Attributes: 15

age

workclass



fnlwgt

education

education-num

marital-status

occupation

relationship

race

sex


capital-gain

capital-loss

hours-per-week

native-country

income

Test mode: 10-fold cross-validation


=== Classifier model (full training set) ===
J48 pruned tree

------------------


capital-gain <= 4064

| capital-loss <= 1721: <=50K (363.0/58.0)

| capital-loss > 1721: >50K (11.0/2.0)

capital-gain > 4064: >50K (26.0/2.0)


Number of Leaves : 3
Size of the tree : 5

Time taken to build model: 0.09 seconds


=== Stratified cross-validation ===

=== Summary ===


Correctly Classified Instances 330 82.5 %

Incorrectly Classified Instances 70 17.5 %

Kappa statistic 0.3378

Mean absolute error 0.2845

Root mean squared error 0.3803

Relative absolute error 80.7521 %

Root relative squared error 90.7109 %

Total Number of Instances 400


=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure Class

0.264 0.01 0.889 0.264 0.407 >50K

0.99 0.736 0.82 0.99 0.897 <=50K
=== Confusion Matrix ===
a b <-- classified as

24 67 | a = >50K

3 306 | b = <=50K
В данном алгоритме дерево имеет лучший вид, чем при использовании метода ID3. При этом процент правильности классификации достаточно высок.
Методом ID3

Для работы данного алгоритма необходимо применить два фильтра: RemoveType - для удаления атрибутов типа numeric и ReplaceMissingValues – для удаления пустых значений.


=== Run information ===
Scheme: weka.classifiers.trees.Id3

Relation: income-weka.filters.unsupervised.attribute.RemoveType-Tnumeric-weka.filters.unsupervised.attribute.ReplaceMissingValues

Instances: 400

Attributes: 9

workclass

education

marital-status

occupation

relationship

race


sex

native-country

income

Test mode: 10-fold cross-validation


=== Classifier model (full training set) ===
Id3

relationship = Wife

| occupation = Tech-support

| | workclass = Private: null

| | workclass = Self-emp-not-inc: null

| | workclass = Self-emp-inc: null

| | workclass = Federal-gov: <=50K

| | workclass = Local-gov: >50K

| | workclass = State-gov: null

| occupation = Craft-repair: >50K

| occupation = Other-service: >50K

| occupation = Sales: null

| occupation = Exec-managerial: >50K

| occupation = Prof-specialty

| | workclass = Private: null

| | workclass = Self-emp-not-inc: <=50K

| | workclass = Self-emp-inc: >50K

| | workclass = Federal-gov: null

| | workclass = Local-gov: >50K

| | workclass = State-gov: null

| occupation = Handlers-cleaners: null

| occupation = Machine-op-inspct: null

| occupation = Adm-clerical: >50K

| occupation = Farming-fishing: null

| occupation = Transport-moving: <=50K

| occupation = Priv-house-serv: null

| occupation = Protective-serv: null

| occupation = Armed-Forces: null

relationship = Own-child

| marital-status = Married-civ-spouse

| | education = Bachelors: null

| | education = Some-college: <=50K

| | education = 11th: null

| | education = HS-grad: null

| | education = Prof-school: null

| | education = Assoc-acdm: null

| | education = Assoc-voc: null

| | education = 9th: null

| | education = 7th-8th: null

| | education = 12th: null

| | education = Masters: null

| | education = 1st-4th: null

| | education = 10th: >50K

| | education = Doctorate: null

| | education = 5th-6th: null

| | education = Preschool: null

| marital-status = Divorced: >50K

| marital-status = Never-married

| | education = Bachelors: <=50K

| | education = Some-college: <=50K

| | education = 11th: <=50K

| | education = HS-grad: <=50K

| | education = Prof-school: null

| | education = Assoc-acdm: null

| | education = Assoc-voc

| | | workclass = Private: >50K

| | | workclass = Self-emp-not-inc: <=50K

| | | workclass = Self-emp-inc: null

| | | workclass = Federal-gov: null

| | | workclass = Local-gov: null

| | | workclass = State-gov: null

| | education = 9th: <=50K

| | education = 7th-8th: <=50K

| | education = 12th: <=50K

| | education = Masters: null

| | education = 1st-4th: null

| | education = 10th: <=50K

| | education = Doctorate: null

| | education = 5th-6th: null

| | education = Preschool: null

| marital-status = Separated: <=50K

| marital-status = Widowed: null

| marital-status = Married-spouse-absent: null

relationship = Husband

| education = Bachelors

| | occupation = Tech-support: >50K

| | occupation = Craft-repair: >50K

| | occupation = Other-service: null

| | occupation = Sales

| | | workclass = Private: >50K

| | | workclass = Self-emp-not-inc: >50K

| | | workclass = Self-emp-inc: >50K

| | | workclass = Federal-gov: null

| | | workclass = Local-gov: null

| | | workclass = State-gov: null

| | occupation = Exec-managerial

| | | workclass = Private

| | | | race = White: >50K

| | | | race = Asian-Pac-Islander: <=50K

| | | | race = Amer-Indian-Eskimo: null

| | | | race = Other: null

| | | | race = Black: null

| | | workclass = Self-emp-not-inc: >50K

| | | workclass = Self-emp-inc: null

| | | workclass = Federal-gov: <=50K

| | | workclass = Local-gov: null

| | | workclass = State-gov: >50K

| | occupation = Prof-specialty

| | | race = White

| | | | native-country = United-States: >50K

| | | | native-country = England: null

| | | | native-country = Puerto-Rico: null

| | | | native-country = Canada: null

| | | | native-country = Germany: null

| | | | native-country = India: null

| | | | native-country = Japan: null

| | | | native-country = Greece: null

| | | | native-country = South: null

| | | | native-country = China: null

| | | | native-country = Cuba: null

| | | | native-country = Iran: null

| | | | native-country = Honduras: null

| | | | native-country = Philippines: null

| | | | native-country = Italy: >50K

| | | | native-country = Poland: null

| | | | native-country = Jamaica: null

| | | | native-country = Vietnam: null

| | | | native-country = Mexico: null

| | | | native-country = Portugal: null

| | | | native-country = Ireland: >50K

| | | | native-country = France: null

| | | | native-country = Dominican-Republic: null

| | | | native-country = Laos: null

| | | | native-country = Ecuador: null

| | | | native-country = Taiwan: null

| | | | native-country = Haiti: null

| | | | native-country = Columbia: null

| | | | native-country = Hungary: null

| | | | native-country = Guatemala: null

| | | | native-country = Nicaragua: null

| | | | native-country = Scotland: null

| | | | native-country = Thailand: null

| | | | native-country = Yugoslavia: null

| | | | native-country = El-Salvador: null

| | | | native-country = Trinadad&Tobago: null

| | | | native-country = Peru: null

| | | | native-country = Hong: null

| | | | native-country = Holand-Netherlands: null

| | | race = Asian-Pac-Islander: >50K

| | | race = Amer-Indian-Eskimo: null

| | | race = Other: null

| | | race = Black: <=50K

| | occupation = Handlers-cleaners: null

| | occupation = Machine-op-inspct: null

| | occupation = Adm-clerical: >50K

| | occupation = Farming-fishing: <=50K

| | occupation = Transport-moving: null

| | occupation = Priv-house-serv: null

| | occupation = Protective-serv: null

| | occupation = Armed-Forces: null

| education = Some-college

| | occupation = Tech-support: >50K

| | occupation = Craft-repair

| | | workclass = Private

| | | | race = White: >50K

| | | | race = Asian-Pac-Islander: null

| | | | race = Amer-Indian-Eskimo: <=50K

| | | | race = Other: null

| | | | race = Black: null

| | | workclass = Self-emp-not-inc: null

| | | workclass = Self-emp-inc: <=50K

| | | workclass = Federal-gov: null

| | | workclass = Local-gov: >50K

| | | workclass = State-gov: null

| | occupation = Other-service: <=50K

| | occupation = Sales

| | | workclass = Private: >50K

| | | workclass = Self-emp-not-inc: <=50K

| | | workclass = Self-emp-inc: null

| | | workclass = Federal-gov: null

| | | workclass = Local-gov: null

| | | workclass = State-gov: null

| | occupation = Exec-managerial

| | | workclass = Private: <=50K

| | | workclass = Self-emp-not-inc: null

| | | workclass = Self-emp-inc: >50K

| | | workclass = Federal-gov: >50K

| | | workclass = Local-gov: <=50K

| | | workclass = State-gov: null

| | occupation = Prof-specialty

| | | workclass = Private: <=50K

| | | workclass = Self-emp-not-inc: null

| | | workclass = Self-emp-inc: null

| | | workclass = Federal-gov: null

| | | workclass = Local-gov: >50K

| | | workclass = State-gov: null

| | occupation = Handlers-cleaners: null

| | occupation = Machine-op-inspct: <=50K

| | occupation = Adm-clerical: >50K

| | occupation = Farming-fishing: <=50K

| | occupation = Transport-moving

| | | workclass = Private: >50K

| | | workclass = Self-emp-not-inc: <=50K

| | | workclass = Self-emp-inc: null

| | | workclass = Federal-gov: null

| | | workclass = Local-gov: null

| | | workclass = State-gov: null

| | occupation = Priv-house-serv: null

| | occupation = Protective-serv: >50K

| | occupation = Armed-Forces: null

| education = 11th

| | occupation = Tech-support: null

| | occupation = Craft-repair: <=50K

| | occupation = Other-service: <=50K

| | occupation = Sales: null

| | occupation = Exec-managerial: >50K

| | occupation = Prof-specialty: null

| | occupation = Handlers-cleaners: null

| | occupation = Machine-op-inspct: null

| | occupation = Adm-clerical: null

| | occupation = Farming-fishing: null

| | occupation = Transport-moving: null

| | occupation = Priv-house-serv: null

| | occupation = Protective-serv: null

| | occupation = Armed-Forces: null

| education = HS-grad

| | native-country = United-States

| | | workclass = Private

| | | | occupation = Tech-support: <=50K

| | | | occupation = Craft-repair

| | | | | race = White: <=50K

| | | | | race = Asian-Pac-Islander: null

| | | | | race = Amer-Indian-Eskimo: <=50K

| | | | | race = Other: null

| | | | | race = Black: null

| | | | occupation = Other-service

| | | | | race = White: <=50K

| | | | | race = Asian-Pac-Islander: null

| | | | | race = Amer-Indian-Eskimo: null

| | | | | race = Other: null

| | | | | race = Black: >50K

| | | | occupation = Sales: <=50K

| | | | occupation = Exec-managerial: null

| | | | occupation = Prof-specialty: null

| | | | occupation = Handlers-cleaners: <=50K

| | | | occupation = Machine-op-inspct: <=50K

| | | | occupation = Adm-clerical: <=50K

| | | | occupation = Farming-fishing: null

| | | | occupation = Transport-moving: <=50K

| | | | occupation = Priv-house-serv: null

| | | | occupation = Protective-serv: null

| | | | occupation = Armed-Forces: null

| | | workclass = Self-emp-not-inc

| | | | occupation = Tech-support: null

| | | | occupation = Craft-repair: <=50K

| | | | occupation = Other-service: null

| | | | occupation = Sales: <=50K

| | | | occupation = Exec-managerial: <=50K

| | | | occupation = Prof-specialty: null

| | | | occupation = Handlers-cleaners: null

| | | | occupation = Machine-op-inspct: null

| | | | occupation = Adm-clerical: null

| | | | occupation = Farming-fishing: >50K

| | | | occupation = Transport-moving: >50K

| | | | occupation = Priv-house-serv: null

| | | | occupation = Protective-serv: null

| | | | occupation = Armed-Forces: null

| | | workclass = Self-emp-inc

| | | | occupation = Tech-support: null

| | | | occupation = Craft-repair: null

| | | | occupation = Other-service: null

| | | | occupation = Sales: >50K

| | | | occupation = Exec-managerial: null

| | | | occupation = Prof-specialty: null

| | | | occupation = Handlers-cleaners: null

| | | | occupation = Machine-op-inspct: null

| | | | occupation = Adm-clerical: null

| | | | occupation = Farming-fishing: <=50K

| | | | occupation = Transport-moving: null

| | | | occupation = Priv-house-serv: null

| | | | occupation = Protective-serv: null

| | | | occupation = Armed-Forces: null

| | | workclass = Federal-gov: >50K

| | | workclass = Local-gov

| | | | occupation = Tech-support: null

| | | | occupation = Craft-repair: >50K

| | | | occupation = Other-service: <=50K

| | | | occupation = Sales: null

| | | | occupation = Exec-managerial: <=50K

| | | | occupation = Prof-specialty: null

| | | | occupation = Handlers-cleaners: <=50K

| | | | occupation = Machine-op-inspct: null

| | | | occupation = Adm-clerical: null

| | | | occupation = Farming-fishing: <=50K

| | | | occupation = Transport-moving: <=50K

| | | | occupation = Priv-house-serv: null

| | | | occupation = Protective-serv: null

| | | | occupation = Armed-Forces: null

| | | workclass = State-gov: null

| | native-country = England: null

| | native-country = Puerto-Rico: null

| | native-country = Canada: null

| | native-country = Germany: null

| | native-country = India: null

| | native-country = Japan: >50K

| | native-country = Greece: null

| | native-country = South: null

| | native-country = China: null

| | native-country = Cuba: >50K

| | native-country = Iran: null

| | native-country = Honduras: null

| | native-country = Philippines: <=50K

| | native-country = Italy: null

| | native-country = Poland: null

| | native-country = Jamaica: null

| | native-country = Vietnam: null

| | native-country = Mexico: <=50K

| | native-country = Portugal: null

| | native-country = Ireland: null

| | native-country = France: null

| | native-country = Dominican-Republic: <=50K

| | native-country = Laos: null

| | native-country = Ecuador: null

| | native-country = Taiwan: null

| | native-country = Haiti: null

| | native-country = Columbia: null

| | native-country = Hungary: null

| | native-country = Guatemala: null

| | native-country = Nicaragua: null

| | native-country = Scotland: null

| | native-country = Thailand: null

| | native-country = Yugoslavia: null

| | native-country = El-Salvador: null

| | native-country = Trinadad&Tobago: null

| | native-country = Peru: null

| | native-country = Hong: null

| | native-country = Holand-Netherlands: null

| education = Prof-school

| | workclass = Private: >50K

| | workclass = Self-emp-not-inc: >50K

| | workclass = Self-emp-inc: >50K

| | workclass = Federal-gov: >50K

| | workclass = Local-gov: <=50K

| | workclass = State-gov: null

| education = Assoc-acdm

| | occupation = Tech-support: null

| | occupation = Craft-repair: null

| | occupation = Other-service: >50K

| | occupation = Sales: null

| | occupation = Exec-managerial: >50K

| | occupation = Prof-specialty: >50K

| | occupation = Handlers-cleaners: null

| | occupation = Machine-op-inspct: null

| | occupation = Adm-clerical: null

| | occupation = Farming-fishing: null

| | occupation = Transport-moving: null

| | occupation = Priv-house-serv: null

| | occupation = Protective-serv: null

| | occupation = Armed-Forces: null

| education = Assoc-voc

| | workclass = Private

| | | occupation = Tech-support: <=50K

| | | occupation = Craft-repair: null

| | | occupation = Other-service: null

| | | occupation = Sales: <=50K

| | | occupation = Exec-managerial: null

| | | occupation = Prof-specialty: >50K

| | | occupation = Handlers-cleaners: null

| | | occupation = Machine-op-inspct: null

| | | occupation = Adm-clerical: null

| | | occupation = Farming-fishing: null

| | | occupation = Transport-moving: null

| | | occupation = Priv-house-serv: null

| | | occupation = Protective-serv: null

| | | occupation = Armed-Forces: null

| | workclass = Self-emp-not-inc: null

| | workclass = Self-emp-inc: <=50K

| | workclass = Federal-gov: >50K

| | workclass = Local-gov: null

| | workclass = State-gov: null

| education = 9th

| | occupation = Tech-support: null

| | occupation = Craft-repair: null

| | occupation = Other-service: null

| | occupation = Sales: null

| | occupation = Exec-managerial: <=50K

| | occupation = Prof-specialty: null

| | occupation = Handlers-cleaners: null

| | occupation = Machine-op-inspct: >50K

| | occupation = Adm-clerical: null

| | occupation = Farming-fishing: null

| | occupation = Transport-moving: <=50K

| | occupation = Priv-house-serv: null

| | occupation = Protective-serv: null

| | occupation = Armed-Forces: null

| education = 7th-8th: <=50K

| education = 12th

| | workclass = Private: >50K

| | workclass = Self-emp-not-inc: <=50K

| | workclass = Self-emp-inc: <=50K

| | workclass = Federal-gov: null

| | workclass = Local-gov: null

| | workclass = State-gov: null

| education = Masters

| | workclass = Private

| | | occupation = Tech-support: null

| | | occupation = Craft-repair: <=50K

| | | occupation = Other-service: null

| | | occupation = Sales: null

| | | occupation = Exec-managerial: >50K

| | | occupation = Prof-specialty: <=50K

| | | occupation = Handlers-cleaners: null

| | | occupation = Machine-op-inspct: null

| | | occupation = Adm-clerical: null

| | | occupation = Farming-fishing: null

| | | occupation = Transport-moving: null

| | | occupation = Priv-house-serv: null

| | | occupation = Protective-serv: null

| | | occupation = Armed-Forces: null

| | workclass = Self-emp-not-inc

| | | occupation = Tech-support: null

| | | occupation = Craft-repair: null

| | | occupation = Other-service: null

| | | occupation = Sales: null

| | | occupation = Exec-managerial: >50K

| | | occupation = Prof-specialty: null

| | | occupation = Handlers-cleaners: null

| | | occupation = Machine-op-inspct: null

| | | occupation = Adm-clerical: null

| | | occupation = Farming-fishing: <=50K

| | | occupation = Transport-moving: null

| | | occupation = Priv-house-serv: null

| | | occupation = Protective-serv: null

| | | occupation = Armed-Forces: null

| | workclass = Self-emp-inc: null

| | workclass = Federal-gov: >50K

| | workclass = Local-gov: <=50K

| | workclass = State-gov: null

| education = 1st-4th: null

| education = 10th

| | occupation = Tech-support: null

| | occupation = Craft-repair: <=50K

| | occupation = Other-service: null

| | occupation = Sales: null

| | occupation = Exec-managerial: null

| | occupation = Prof-specialty: null

| | occupation = Handlers-cleaners: null

| | occupation = Machine-op-inspct: <=50K

| | occupation = Adm-clerical: null

| | occupation = Farming-fishing: null

| | occupation = Transport-moving

| | | workclass = Private: >50K

| | | workclass = Self-emp-not-inc: <=50K

| | | workclass = Self-emp-inc: null

| | | workclass = Federal-gov: null

| | | workclass = Local-gov: null

| | | workclass = State-gov: null

| | occupation = Priv-house-serv: null

| | occupation = Protective-serv: null

| | occupation = Armed-Forces: null

| education = Doctorate

| | workclass = Private: >50K

| | workclass = Self-emp-not-inc: <=50K

| | workclass = Self-emp-inc: null

| | workclass = Federal-gov: null

| | workclass = Local-gov: null

| | workclass = State-gov: null

| education = 5th-6th

| | workclass = Private: >50K

| | workclass = Self-emp-not-inc: null

| | workclass = Self-emp-inc: null

| | workclass = Federal-gov: null

| | workclass = Local-gov: <=50K

| | workclass = State-gov: null

| education = Preschool: null

relationship = Not-in-family

| occupation = Tech-support: <=50K

| occupation = Craft-repair

| | workclass = Private

| | | marital-status = Married-civ-spouse: null

| | | marital-status = Divorced: <=50K

| | | marital-status = Never-married

| | | | education = Bachelors: <=50K

| | | | education = Some-college: <=50K

| | | | education = 11th: null

| | | | education = HS-grad: <=50K

| | | | education = Prof-school: null

| | | | education = Assoc-acdm: null

| | | | education = Assoc-voc: null

| | | | education = 9th: null

| | | | education = 7th-8th: null

| | | | education = 12th: null

| | | | education = Masters: null

| | | | education = 1st-4th: null

| | | | education = 10th: null

| | | | education = Doctorate: null

| | | | education = 5th-6th: null

| | | | education = Preschool: null

| | | marital-status = Separated: <=50K

| | | marital-status = Widowed: null

| | | marital-status = Married-spouse-absent: <=50K

| | workclass = Self-emp-not-inc: <=50K

| | workclass = Self-emp-inc: null

| | workclass = Federal-gov: null

| | workclass = Local-gov: >50K

| | workclass = State-gov: null

| occupation = Other-service: <=50K

| occupation = Sales

| | race = White: <=50K

| | race = Asian-Pac-Islander: null

| | race = Amer-Indian-Eskimo: null

| | race = Other: null

| | race = Black: >50K

| occupation = Exec-managerial

| | native-country = United-States

| | | education = Bachelors

| | | | sex = Female: <=50K

| | | | sex = Male: >50K

| | | education = Some-college: <=50K

| | | education = 11th: <=50K

| | | education = HS-grad: <=50K

| | | education = Prof-school: null

| | | education = Assoc-acdm: null

| | | education = Assoc-voc: null

| | | education = 9th: null

| | | education = 7th-8th: null

| | | education = 12th: null

| | | education = Masters: <=50K

| | | education = 1st-4th: null

| | | education = 10th: null

| | | education = Doctorate: null

| | | education = 5th-6th: null

| | | education = Preschool: null

| | native-country = England: null

| | native-country = Puerto-Rico: null

| | native-country = Canada: null

| | native-country = Germany: null

| | native-country = India: null

| | native-country = Japan: null

| | native-country = Greece: null

| | native-country = South: null

| | native-country = China: null

| | native-country = Cuba: null

| | native-country = Iran: null

| | native-country = Honduras: null

| | native-country = Philippines: null

| | native-country = Italy: null

| | native-country = Poland: null

| | native-country = Jamaica: null

| | native-country = Vietnam: null

| | native-country = Mexico: null

| | native-country = Portugal: null

| | native-country = Ireland: null

| | native-country = France: >50K

| | native-country = Dominican-Republic: null

| | native-country = Laos: null

| | native-country = Ecuador: null

| | native-country = Taiwan: null

| | native-country = Haiti: null

| | native-country = Columbia: null

| | native-country = Hungary: null

| | native-country = Guatemala: null

| | native-country = Nicaragua: null

| | native-country = Scotland: null

| | native-country = Thailand: null

| | native-country = Yugoslavia: null

| | native-country = El-Salvador: null

| | native-country = Trinadad&Tobago: null

| | native-country = Peru: null

| | native-country = Hong: null

| | native-country = Holand-Netherlands: null

| occupation = Prof-specialty

| | education = Bachelors: <=50K

| | education = Some-college

| | | marital-status = Married-civ-spouse: null

| | | marital-status = Divorced: >50K

| | | marital-status = Never-married: <=50K

| | | marital-status = Separated: null

| | | marital-status = Widowed: null

| | | marital-status = Married-spouse-absent: null

| | education = 11th: null

| | education = HS-grad: null

| | education = Prof-school: >50K

| | education = Assoc-acdm: null

| | education = Assoc-voc: <=50K

| | education = 9th: null

| | education = 7th-8th: null

| | education = 12th: null

| | education = Masters: <=50K

| | education = 1st-4th: null

| | education = 10th: null

| | education = Doctorate: <=50K

| | education = 5th-6th: null

| | education = Preschool: null

| occupation = Handlers-cleaners: <=50K

| occupation = Machine-op-inspct: <=50K

| occupation = Adm-clerical: <=50K

| occupation = Farming-fishing

| | workclass = Private: <=50K

| | workclass = Self-emp-not-inc: <=50K

| | workclass = Self-emp-inc: >50K

| | workclass = Federal-gov: null

| | workclass = Local-gov: null

| | workclass = State-gov: null

| occupation = Transport-moving: <=50K

| occupation = Priv-house-serv: <=50K

| occupation = Protective-serv: <=50K

| occupation = Armed-Forces: null

relationship = Other-relative: <=50K

relationship = Unmarried

| occupation = Tech-support: null

| occupation = Craft-repair: <=50K

| occupation = Other-service: <=50K

| occupation = Sales: <=50K

| occupation = Exec-managerial: <=50K

| occupation = Prof-specialty

| | education = Bachelors

| | | marital-status = Married-civ-spouse: null

| | | marital-status = Divorced: null

| | | marital-status = Never-married: <=50K

| | | marital-status = Separated: null

| | | marital-status = Widowed: >50K

| | | marital-status = Married-spouse-absent: null

| | education = Some-college: null

| | education = 11th: null

| | education = HS-grad: >50K

| | education = Prof-school: null

| | education = Assoc-acdm: null

| | education = Assoc-voc: <=50K

| | education = 9th: null

| | education = 7th-8th: null

| | education = 12th: null

| | education = Masters: null

| | education = 1st-4th: null

| | education = 10th: null

| | education = Doctorate: null

| | education = 5th-6th: null

| | education = Preschool: null

| occupation = Handlers-cleaners: <=50K

| occupation = Machine-op-inspct: <=50K

| occupation = Adm-clerical: <=50K

| occupation = Farming-fishing: <=50K

| occupation = Transport-moving: >50K

| occupation = Priv-house-serv: <=50K

| occupation = Protective-serv: <=50K

| occupation = Armed-Forces: null


Time taken to build model: 0.02 seconds
=== Stratified cross-validation ===

=== Summary ===


Correctly Classified Instances 247 61.75 %

Incorrectly Classified Instances 92 23 %

Kappa statistic 0.2411

Mean absolute error 0.2742

Root mean squared error 0.5033

Relative absolute error 92.3069 %

Root relative squared error 131.0728 %

UnClassified Instances 61 15.25 %

Total Number of Instances 400
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure Class

0.434 0.186 0.402 0.434 0.418 >50K

0.814 0.566 0.833 0.814 0.823 <=50K
=== Confusion Matrix ===
a b <-- classified as

33 43 | a = >50K

49 214 | b = <=50
Дерево имеет сильно разветвленную структуру. Но видно, что многие значения – пустые (null), то есть такая комбинация параметров в исходных данных не встречается. Дерево представляет собой набор правил, по которым можно классифицировать объект. Например, если relationship = Wife, occupation = Tech-support, workclass = Federal-gov, тогда income =<=50K, а если workclass = Local-gov income =>50K. Вероятность верной классификации составляет 62%.
Методом 1R
Данный алгоритм строит правила по значению одной независимой переменной, поэтому его называют «1-правило». Для любого возможного значения каждой независимой переменной формируется правило, которое классифицирует объект из обучающей выборки. При этом в заключительной части правила указывается наиболее часто встречающееся значение зависимой переменной у данной независимой переменной. Ошибкой правила считается количество значений независимой переменной, имеющее другой значение зависимой переменной. Оценив ошибки, выбирается переменная, для которой ошибка набора минимальна. Наиболее серьезный недостаток - сверхчувствительность, алгоритм выбирает переменные, стремящиеся к ключу (т.е. с максимальным количеством значений, у ключа ошибка вообще 0, но он несет информации). Эффективен, если объекты классифицируются по одному атрибуту.
=== Run information ===
Scheme: weka.classifiers.rules.OneR -B 6

Relation: income

Instances: 400

Attributes: 15

age

workclass



fnlwgt

education

education-num

marital-status

occupation

relationship

race

sex


capital-gain

capital-loss

hours-per-week

native-country

income

Test mode: 10-fold cross-validation


=== Classifier model (full training set) ===
capital-gain:

< 4225.0 -> <=50K

>= 4225.0 -> >50K

(331/400 instances correct)

Time taken to build model: 0.02 seconds
=== Stratified cross-validation ===

=== Summary ===


Correctly Classified Instances 331 82.75 %

Incorrectly Classified Instances 69 17.25 %

Kappa statistic 0.3439

Mean absolute error 0.1725

Root mean squared error 0.4153

Relative absolute error 48.9611 %

Root relative squared error 99.0685 %

Total Number of Instances 400


=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure Class

0.264 0.006 0.923 0.264 0.41 >50K

0.994 0.736 0.821 0.994 0.899 <=50K
=== Confusion Matrix ===
a b <-- classified as

24 67 | a = >50K

2 307 | b = <=50K
В данном алгоритме в качестве переменной, для которой построены правила с наименьшей ошибкой, была определена переменная capital-gain. Алгоритм вывел, что при капитале < 4225.0 человек имеет зарплату <=50K, а при капитале >= 4225.0 зарплату >50K. По результатам кросс-проверки ошибка составила 17.25 %.
Метод SVM
Данный метод является алгоритмом классификации с использованием нелинейных математических функций. Идея метода основывается на том, что наилучшим способом разделения точек в n-мерном пространстве является n-1 плоскость, равноудаленная от точек принадлежащих разным классам.
=== Run information ===
Scheme: weka.classifiers.functions.SMO -C 1.0 -E 1.0 -G 0.01 -A 250007 -L 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1

Relation: income

Instances: 400

Attributes: 15

age

workclass



fnlwgt

education

education-num

marital-status

occupation

relationship

race

sex


capital-gain

capital-loss

hours-per-week

native-country

income

Test mode: 10-fold cross-validation


=== Classifier model (full training set) ===
SMO
Classifier for classes: >50K, <=50K
BinarySMO
Machine linear: showing attribute weights, not support vectors.
-0.6491 * (normalized) age

+ 0.1751 * (normalized) workclass=Private

+ 0.3828 * (normalized) workclass=Self-emp-not-inc

+ -0.0568 * (normalized) workclass=Self-emp-inc

+ -0.6314 * (normalized) workclass=Federal-gov

+ 0.5019 * (normalized) workclass=Local-gov

+ -0.3715 * (normalized) workclass=State-gov

+ -0.5522 * (normalized) fnlwgt

+ -0.3053 * (normalized) education=Bachelors

+ 0.0967 * (normalized) education=Some-college

+ 0.0751 * (normalized) education=11th

+ 0.2955 * (normalized) education=HS-grad

+ -1.05 * (normalized) education=Prof-school

+ -0.3735 * (normalized) education=Assoc-acdm

+ -0.1411 * (normalized) education=Assoc-voc

+ 0.2594 * (normalized) education=9th

+ 0 * (normalized) education=7th-8th

+ 0.3087 * (normalized) education=12th

+ 0.1872 * (normalized) education=Masters

+ 0.185 * (normalized) education=10th

+ 0.6148 * (normalized) education=Doctorate

+ -0.1525 * (normalized) education=5th-6th

+ -0.3149 * (normalized) education-num

+ -0.9475 * (normalized) marital-status=Married-civ-spouse

+ -0.2177 * (normalized) marital-status=Divorced

+ 0.5201 * (normalized) marital-status=Never-married

+ 0.1537 * (normalized) marital-status=Separated

+ 0.014 * (normalized) marital-status=Widowed

+ 0.4774 * (normalized) marital-status=Married-spouse-absent

+ 0.1135 * (normalized) occupation=Tech-support

+ -0.1944 * (normalized) occupation=Craft-repair

+ 0.1232 * (normalized) occupation=Other-service

+ -0.0642 * (normalized) occupation=Sales

+ -0.132 * (normalized) occupation=Exec-managerial

+ -0.259 * (normalized) occupation=Prof-specialty

+ 0 * (normalized) occupation=Handlers-cleaners

+ 0.4865 * (normalized) occupation=Machine-op-inspct

+ -0.2118 * (normalized) occupation=Adm-clerical

+ 0.431 * (normalized) occupation=Farming-fishing

+ 0.057 * (normalized) occupation=Transport-moving

+ 0 * (normalized) occupation=Priv-house-serv

+ -0.6407 * (normalized) occupation=Protective-serv

+ 0.2909 * (normalized) occupation=Armed-Forces

+ -1.2382 * (normalized) relationship=Wife

+ -0.2696 * (normalized) relationship=Own-child

+ 0.4911 * (normalized) relationship=Husband

+ 0.3375 * (normalized) relationship=Not-in-family

+ 0.3823 * (normalized) relationship=Other-relative

+ 0.2969 * (normalized) relationship=Unmarried

+ 0.0087 * (normalized) race=White

+ -0.4763 * (normalized) race=Asian-Pac-Islander

+ 0.5512 * (normalized) race=Amer-Indian-Eskimo

+ -0.177 * (normalized) race=Other

+ 0.0935 * (normalized) race=Black

+ -0.382 * (normalized) sex

+ -1.6847 * (normalized) capital-gain

+ -0.9474 * (normalized) capital-loss

+ -1.5535 * (normalized) hours-per-week

+ 0.4497 * (normalized) native-country=United-States

+ -0.0572 * (normalized) native-country=Puerto-Rico

+ 0.63 * (normalized) native-country=Canada

+ -0.7262 * (normalized) native-country=Germany

+ 0.6078 * (normalized) native-country=India

+ 0 * (normalized) native-country=Japan

+ -1 * (normalized) native-country=Cuba

+ 0.675 * (normalized) native-country=Philippines

+ -0.3892 * (normalized) native-country=Italy

+ 0.8802 * (normalized) native-country=Mexico

+ -0.3632 * (normalized) native-country=Ireland

+ -1 * (normalized) native-country=France

+ 0.4555 * (normalized) native-country=Dominican-Republic

+ -0.1624 * (normalized) native-country=Nicaragua

+ 2.0519
Number of kernel evaluations: 51296 (89.919% cached)

Time taken to build model: 0.48 seconds


=== Stratified cross-validation ===

=== Summary ===


Correctly Classified Instances 324 81 %

Incorrectly Classified Instances 76 19 %

Kappa statistic 0.3883

Mean absolute error 0.19

Root mean squared error 0.4359

Relative absolute error 53.9282 %

Root relative squared error 103.9723 %

Total Number of Instances 400


=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure Class

0.418 0.074 0.623 0.418 0.5 >50K

0.926 0.582 0.844 0.926 0.883 <=50K
=== Confusion Matrix ===
a b <-- classified as

38 53 | a = >50K

23 286 | b = <=50K
Вывод данного алгоритма представлен в виде вектора n-мерного пространства. Цифры, указанные в выводе – коэффициенты, задающие плоскость, разделяющую исходные данные на классы. Процент верной классификации оказался достаточно высоким – 81,0%.
Сравнение результатов алгоритмов классификации:

Метод

Точность классификации,%

Средняя ошибка метода,%

NAIVE BAYES

83.25

16.75

ID3

61.75

23.0

J4.8

82.5

17.5

1R

82.75

17.25

SVM (SMO)

81

19



ЗАДАНИЕ 3: Провести поиск ассоциативных правил методом Априори в исходных данных. В случае невозможности применения алгоритма, использовать фильтры данных. Поменять метрики оценки правил. Описать полученные результаты.
Задача поиска ассоциативных правил предполагает отыскание частых наборов в большом числе наборов данных. Алгоритм Априори использует одно из свойств поддержки, гласящее: поддержка любого набора объектов не может превышать минимальной поддержи любого из его подмножеств. Алгоритм Априори определяет часто встречающиеся наборы за несколько этапов. На i-ом этапе определяются все часто встречающиеся i-элементные наборы. Каждый этап состоит из двух шагов: формирования кандидатов и подсчет поддержки кандидатов.

Поиск правил был произведен при трех значениях minMetric.


minMetric=0,9
=== Run information ===
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0

Relation: income-weka.filters.unsupervised.attribute.RemoveType-Tnumeric

Instances: 400

Attributes: 9

workclass

education

marital-status

occupation

relationship

race


sex

native-country

income

=== Associator model (full training set) ===



Apriori

=======
Minimum support: 0.35 (140 instances)

Minimum metric : 0.9

Number of cycles performed: 13


Generated sets of large itemsets:
Size of set of large itemsets L(1): 8
Size of set of large itemsets L(2): 16
Size of set of large itemsets L(3): 12
Size of set of large itemsets L(4): 2
Best rules found:
1. relationship=Husband 155 ==> marital-status=Married-civ-spouse sex=Male 155 conf:(1)

2. marital-status=Married-civ-spouse relationship=Husband 155 ==> sex=Male 155 conf:(1)

3. relationship=Husband sex=Male 155 ==> marital-status=Married-civ-spouse 155 conf:(1)

4. relationship=Husband 155 ==> sex=Male 155 conf:(1)

5. relationship=Husband 155 ==> marital-status=Married-civ-spouse 155 conf:(1)

6. marital-status=Married-civ-spouse sex=Male 158 ==> relationship=Husband 155 conf:(0.98)

7. marital-status=Never-married 148 ==> income=<=50K 144 conf:(0.97)

8. race=White income=<=50K 252 ==> native-country=United-States 235 conf:(0.93)

9. race=White sex=Male 232 ==> native-country=United-States 215 conf:(0.93)

10. race=White 330 ==> native-country=United-States 305 conf:(0.92)



minMetric=0,6
=== Run information ===
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.6 -D 0.05 -U 1.0 -M 0.1 -S -1.0

Relation: income-weka.filters.unsupervised.attribute.RemoveType-Tnumeric

Instances: 400

Attributes: 9

workclass

education

marital-status

occupation

relationship

race


sex

native-country

income

=== Associator model (full training set) ===



Apriori

=======
Minimum support: 0.55 (220 instances)

Minimum metric : 0.6

Number of cycles performed: 9


Generated sets of large itemsets:
Size of set of large itemsets L(1): 5
Size of set of large itemsets L(2): 6
Size of set of large itemsets L(3): 1
Best rules found:
1. race=White income=<=50K 252 ==> native-country=United-States 235 conf:(0.93)

2. race=White 330 ==> native-country=United-States 305 conf:(0.92)

3. income=<=50K 309 ==> native-country=United-States 279 conf:(0.9)

4. sex=Male 272 ==> native-country=United-States 242 conf:(0.89)

5. workclass=Private 267 ==> native-country=United-States 234 conf:(0.88)

6. native-country=United-States 355 ==> race=White 305 conf:(0.86)

7. sex=Male 272 ==> race=White 232 conf:(0.85)

8. native-country=United-States income=<=50K 279 ==> race=White 235 conf:(0.84)

9. income=<=50K 309 ==> race=White 252 conf:(0.82)

10. native-country=United-States 355 ==> income=<=50K 279 conf:(0.79)



minMetric=0,3
=== Run information ===
Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.3 -D 0.05 -U 1.0 -M 0.1 -S -1.0

Relation: income-weka.filters.unsupervised.attribute.RemoveType-Tnumeric

Instances: 400

Attributes: 9

workclass

education

marital-status

occupation

relationship

race


sex

native-country

income

=== Associator model (full training set) ===



Apriori

=======
Minimum support: 0.55 (220 instances)

Minimum metric : 0.3

Number of cycles performed: 9


Generated sets of large itemsets:
Size of set of large itemsets L(1): 5
Size of set of large itemsets L(2): 6
Size of set of large itemsets L(3): 1
Best rules found:
1. race=White income=<=50K 252 ==> native-country=United-States 235 conf:(0.93)

2. race=White 330 ==> native-country=United-States 305 conf:(0.92)

3. income=<=50K 309 ==> native-country=United-States 279 conf:(0.9)

4. sex=Male 272 ==> native-country=United-States 242 conf:(0.89)

5. workclass=Private 267 ==> native-country=United-States 234 conf:(0.88)

6. native-country=United-States 355 ==> race=White 305 conf:(0.86)

7. sex=Male 272 ==> race=White 232 conf:(0.85)

8. native-country=United-States income=<=50K 279 ==> race=White 235 conf:(0.84)

9. income=<=50K 309 ==> race=White 252 conf:(0.82)

10. native-country=United-States 355 ==> income=<=50K 279 conf:(0.79)

В данном алгоритме применен фильтр RemoveType - для удаления атрибутов типа numeric. В результате во всех 3 случаях мы получили набор правил. Так как было выставлено значение NumRules=10, то в итоге было получено по 10 правил, обладающих наибольшим параметром conf. Чем больше параметр minMetric, тем получается достоверней результат. Для minMetric=0.9, было выведено, например, следующее правило: если семейное положение = состоит в браке и родство = муж, тогда однозначно пол = мужской (marital-status=Married-civ-spouse relationship=Husband 155 ==> sex=Male conf:(1)). Для minMetric=0.6 было выведено правило: если зарплата =<=50K, тогда с вероятностью 0.9 родная страна = United-States (income=<=50K 309 ==> native-country=United-States 279 conf:(0.9))

ЗАКЛЮЧЕНИЕ


В данной лабораторной работе на практике была проведена классификация данных и получены ассоциативные правила. Были освоены проведенные в работе методы классификации, а также границы их применения. Подробно был разобран алгоритм поиска ассоциативных правил Apriori.


  1. НАБОР ДАННЫХ


datamining400-46


Москва 2008


Смотрите также:
Отчет о лаботарорной работе методы и средства анализа данных по теме: «Система анализа данных weka»
383.87kb.
2 стр.
Отчет о лаботарорной работе по дисциплине Методы и средства анализа данных по теме: «Система анализа данных weka»
229.16kb.
1 стр.
Отчет о лаботарорной работе методы и средства анализа данных по теме
286.73kb.
1 стр.
Место теории измерений в методах анализа данных
266.06kb.
1 стр.
Методы анализа данных Кредиты: 3 Аннотация дисциплины
17.78kb.
1 стр.
Особенности анализа многомерных данных
170.74kb.
1 стр.
Лабораторная работа №4 Методы интеллектуального анализа данных. Обнаружение логических закономерностей на основе деревьев решений
104.04kb.
1 стр.
Методы интеллектуального анализа данных и некоторые их приложения1
28.3kb.
1 стр.
Б. Нойес Привязка данных в Windows Forms Книга охватывает все аспекты привязки данных в Windows Forms. Описываются средства, обеспечивающие связь с базой данных, такие, как типизированные наборы данных и адапт
69.76kb.
1 стр.
Методология психодиагностики и обработки экспериментальных данных
45.47kb.
1 стр.
Отчет по результатам работы по программе усовершенствования базы данных по сортам растений и изложить предложения по усовершенствованию базы данных по сортам растений
712.53kb.
4 стр.
Формула специальности: Содержанием специальности 22. 00. 04 – «Социальная структура, социальные институты и процессы»
36.75kb.
1 стр.