Chapter1: What Is Data Mining? Data mining (knowledge discovery from data) Alternative names o Knowledge discovery (mining) in databases (KDD), knowle

Chapter1: What Is Data Mining? Data mining (knowledge discovery from data) Alternative names o Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. o o ما هو استخراج البيانات استخراج البيانات )اكتشاف المعرفة من البيانات( أسماء بديلة اكتشاف المعرفة )التعدين( في قواعد البيانات KDD)( واستخراج المعرفة وتحليل البيانات / نمط وعلم اآلثار البيانات التجريف البيانات والحصاد من المعلومات ذكاء األعمال الخ o صفحة 1

Why Not Traditional Data Analysis? Tremendous amount of data صفحة 2

High-dimensionality of data High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data. New and sophisticated applications لماذا ال تحليل البيانات التقليدية كم هائل من البيانات ارتفاع األبعاد للبيانات عالية التعقيد البيانات: البيانات والجداول وبيانات االستشعار بيانات السالسل الزمنية والبيانات الزمنية تسلسل البيانات تطبيقات جديدة ومتطورة. Multi-Dimensional View of Data Mining? Data to be mined Knowledge to be mined Techniques utilized Applications adapted عرض متعدد األبعاد للتعدين البيانات البيانات التي يتم استخراج المعرفة التي يمكن الملغومة التقنيات المستخدمة تطبيقات تكييفها صفحة 3

Data Mining: Classification Schemes? General functionality Descriptive data mining Predictive data mining بيانات مناجم: خطط التصنيف وظائف عامة استخراج البيانات وصفي استخراج البيانات التنبؤي Data Mining: On What Kinds of Data? Database-oriented data sets and applications Advanced data sets and advanced applications Multimedia database Text databases The World-Wide Web مناجم البيانات: على ما هي أنواع البيانات مجموعات البيانات الموجهة قاعدة بيانات والتطبيقات مجموعات البيانات المتقدمة والتطبيقات المتقدمة قاعدة بيانات الوسائط المتعددة قواعد البيانات النص الويب في جميع أنحاء العالم Data Mining Functionalities? Frequent patterns Classification and prediction صفحة 4

Cluster analysis Outlier analysis Trend and evolution analysis Other pattern-directed or statistical analyses KDD Process: Several Key Steps? Learning the application domain Creating a target data set: data selection Choosing functions of data mining Choosing the mining algorithm Use of discovered knowledge التعدين وظائف البيانات أنماط متكررة تصنيف والتنبؤ التحليل العنقودي تحليل ناشز االتجاه والتطور تحليل التحليالت الموجهة نمط أو اإلحصائية األخرى عدة خطوات رئيسية : KDDعملية تعلم مجال التطبيق إنشاء مجموعة البيانات الهدف: اختيار البيانات اختيار مهام التنقيب عن البيانات اختيار خوارزمية التعدين استخدام المعرفة اكتشف Are All the Discovered Patterns Interesting? Objective vs. subjective interestingness measures Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. صفحة 5

Subjective: based on user s belief in the data, e.g., unexpectedness, novelty, action ability, etc. جميع أنماط "اكتشف" مثيرة لالهتمام مقابل موضوعي التدابير اإلمتاع الذاتية الهدف: بناء على اإلحصاءات وهياكل وأنماط مثل الدعم والثقة الخ ذاتية: على أساس االعتقاد المستخدم في البيانات على سبيل المثال فجائية والجدة والقدرة العمل الخ Primitives that Define a Data Mining Task? Task-relevant data Type of knowledge to be mined Background knowledge Pattern interestingness measurements Visualization األوليات التي تحدد التعدين المهمة البيانات البيانات المهمة ذات الصلة نوع من المعرفة التي يمكن الملغومة المعرفة الخلفية قياسات نمط اإلمتاع التصور Primitive 3: Background Knowledge? A typical kind of background knowledge: Concept hierarchies: صفحة 6

Schema hierarchy Set-grouping hierarchy Operation-derived hierarchy Rule-based hierarchy بدائي 3: المعرفة الخلفية: وهناك نوع نموذجي من المعرفة الخلفية: التسلسالت الهرمية مفهوم التسلسل الهرمي المخطط التسلسل الهرمي تجميع تعيين- المشتقة من عملية التسلسل الهرمي التسلسل الهرمي القائم على حكم Primitive 4: Pattern Interestingness Measure? Simplicity Certainty Utility Novelty 4 البدائي: نمط اإلمتاع قياس بساطة يقين فائدة صفحة 7

حداثة Integration of Data Mining and Data Warehousing? Data mining systems, DBMS, Data warehouse systems coupling On-line analytical mining data Interactive mining multi-level knowledge Integration of multiple mining functions التكامل للتعدين البيانات وتخزين البيانات أنظمة استخراج البيانات ونظم إدارة قواعد البيانات ونظم مستودع البيانات اقتران على خط البيانات التعدين التحليلي التعدين التفاعلية ادمج وظائف متعددة التعدين لمعرفة متعدد المستويات صفحة 8

Data Mining Applications Database analysis and decision support Market analysis and management target marketing, customer relation management, market basket analysis, cross selling, market segmentation Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and management ports and Entertainment IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat Astronomy JPL and the Palomar Observatory discovered 22 quasars with the help of data mining Retail and Marketing Customer buying patterns/demographic characteristics Mailing campaigns Market basket analysis Trend analysis صفحة 9

تطبيقات التعدين البيانات تحليل قاعدة البيانات ودعم اتخاذ القرار تحليل السوق واإلدارة التسويق المستهدفة وإدارة العالقات مع العمالء وتحليل سلة السوق عبر بيع وتجزئة السوق تحليل المخاطر وإدارة التنبؤ والمحافظة على العمالء وتحسين االكتتاب الجودة وتحليل تنافسية ومراقبة الكشف عن الغش واإلدارة الرياضة والترفيه IBMالكشافة المتقدم تحليل اإلحصاءات لعبة ( NBAطلقات منعت ويساعد واألخطاء( الكتساب ميزة تنافسية لنيويورك نيكس وميامي هيت علم الفلك اكتشف مختبر الدفع النفاث ومرصد بالومار مساعدة من استخراج البيانات 22 مع الكوازارات تجارة التجزئة والتسويق أنماط شراء العمالء / الخصائص الديموغرافية الحمالت البريدية تحليل سلة السوق تحليل االتجاهات صفحة 10

Chapter2: Waht are the causes Data Preprocessing? 1)is dirty 2)data a quality is important for quality of DM results االسباب: ماهي اسباب معالجه البيانات هو القذرة جوده البيانات لها نوعيه مهمه من نتائج استخراج البيانات Why Is Data Dirty? Incomplete data Noisy data Inconsistent data Duplicate records also need data cleaning لماذا هي البيانات القذرة بيانات غير مكتملة البيانات صاخبة بيانات غير متناسقة تحتاج السجالت المكررة أيضا تنظيف البيانات Multi-Dimensional Measure of Data Quality? Accuracy صفحة 11

Completeness Consistency Timeliness Believability متعدد األبعاد قياس جودة البيانات دقة كمال اتساق توقيت المصداقية Major Tasks in Data Preprocessing? Data cleaning Data integration Data transformation Data reduction Data discretization المهام الرئيسية في معالجة البيانات Data cleaning tasks? Fill in missing values تنظيف البيانات تكامل البيانات تحويل البيانات اختزال البيانات تفريد البيانات صفحة 12

Identify outliers and smooth out noisy data Correct inconsistent data Resolve redundancy caused by data integration Missing Data? مهام التنظيف البيانات 1 (ملء القيم المفقودة 2 (تحديد القيم المتطرفة وتذليل البيانات صاخبة 3 (البيانات غير متناسقة الصحيح 4 (حل التكرار الناجمة عن تكامل البيانات equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding بيانات مفقودة عطل المعدات تتعارض مع البيانات المسجلة األخرى وبالتالي حذفها البيانات لم يتم إدخالها بسبب سوء الفهم How to Handle Missing Data? Ignore the taple Fill in the missing value manually Fill in it automatically Unknown كيفية التعامل مع البيانات المفقودة تجاهل الجدول امأل القيمة املفقودة يدويا امألها تلقائيا غير معروف صفحة 13

Noisy Data? Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data بيانات صاخبة الضوضاء: خطأ عشوائي أو التباين في متغير قياس قد ترجع الى قيم السمات غير الصحيحة : أدوات جمع البيانات الخاطئة مشاكل إدخال البيانات مشاكل نقل البيانات الحد من التكنولوجيا عدم االتساق في اصطالح التسمية مشاكل البيانات األخرى التي تتطلب تنظيف البيانات سجالت مكررة البيانات غير مكتملة البيانات غير متناسقة صفحة 14

Data Cleaning as a Process? Data discrepancy detection Data scrubbing: use simple domain knowledge Data auditing: by analyzing data to discover rules and relationship to detect violators Data migration and integration Integration of the two processes البيانات تنظيف كعملية الكشف عن التناقض البيانات تنظيف البيانات: استخدام معرفة مجال بسيط تدقيق البيانات: من خالل تحليل البيانات الكتشاف القواعد والعالقة للكشف عن المخالفين ترحيل البيانات والتكامل تكامل العمليتين Data Integration? Data integration: Combines data from multiple sources into a coherent store Schema integration Detecting and resolving data value conflicts البيانات تكامل تكامل البيانات: يجمع البيانات من مصادر متعددة في مخزن متماسكة تكامل المخطط الكشف عن تعارضات قيمة البيانات وحلها صفحة 15

Handling Redundancy in Data Integration? Redundant: data occur often when integration of multiple databases Object identification: The same attribute or object may have different names in different databases Derivable data: One attribute may be a derived attribute in another table التعامل مع التكرار في تكامل البيانات زائدة عن الحاجة: تحدث البيانات في كثير من األحيان عند دمج قواعد بيانات متعددة تعريف الكائن: نفس السمة أو الكائن قد يكون لهما أسماء مختلفة في قواعد بيانات مختلفة البيانات القابلة للتشغيل: قد تكون إحدى السمات سمة "مشتقة" في جدول آخر Data Transformation? Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range البيانات تحويل تجانس: إزالة الضوضاء من البيانات التجميع: تلخيص وبناء مكعب البيانات التعميم: مفهوم التسلق الهرمي تطبيع: تحجيم أن تقع ضمن نطاق صغير محدد صفحة 16

Discretization? Three types of attributes: Nominal values from an unordered set, e.g., color, profession Ordinal values from an ordered set, e.g., military or academic rank Continuous real numbers, e.g., integer or real numbers Discretization: Divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. تفريد ثالثة أنواع من السمات: 1 (االسمية - القيم من مجموعة غير مرتبة على سبيل المثال اللون والمهنة 2 (ترتيبي - القيم من مجموعة مرتبة على سبيل المثال أو الرتبة األكاديمية العسكرية 3 (المستمر - األعداد الحقيقية على سبيل المثال عدد صحيح أو األعداد الحقيقية إعداد لمزيد من التحليل تفريد: تقسيم مجموعة السمة المستمرة إلى فترات بعض خوارزميات التصنيف تقبل فقط سمات تصنيفية. Discretization and Concept Hierarchy? Discretization Reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals Interval labels can then be used to replace actual data values Concept hierarchy formation صفحة 17

Recursively reduce the data by collecting and replacing low level concepts التفريد والمفهوم الهرمي تفريد فترات إلى السمة مدى بقسمة معينة معينة لسمة القيم عدد تقليل استخدام يمكن الزمني الفاصل تسميات الفعلية البيانات قيم الستبدال التسلسل الهرمي للمفاهيم تشكيل تقليل بشكل متكرر للبيانات من خالل جمع واستبدال مفاهيم المستوى المنخفض شرح المسائل: Binning methods for data smoothing صفحه 23 الخطوه االولى: ترتيب االرقام سوا كان تصاعيدا او تنازلي حسب المطلوب في السوال الخطوه الثانيه: binونعرف كم مطلوب في السوال مثال عدد االرقام الموجوده 12 رقم والبن= 3 12/3=4 اذن في كل بن اربع اعداد الخطوه الثالثه: نجمع عدد العناصر الموجوده مثال في كل بن ونقسمها على عدد العناصر bin 1:4,8,9,15 4+8+9+15=36 36/4=9 4 عدد العناصر ونعوض عن كل عدد ب الرقم 9 الخطوه الرابعه: التقريب نقرب كل عدد من نفسه ثم نقرب العدد الثاني من العدد االقرب له سوا كان عن يمينه او عن يساره وهكذا في العدد الثالث صفحة 18

شرح مسائله صفحه) 29 ( صفحة 19

What is Data Warehouse? Chapter3: A data warehouse is a subject-oriented, integrated, timevariant, and nonvolatile collection of data in support of management s decision-making Data warehousing: The process of constructing and using data warehouses مستودع هو ما البيانات مستودع البيانات هو عبارة عن مجموعة من البيانات موجهة نحو الموضوع ومتكاملة ومتغيرة زمنيا وغير متحيزة لدعم عملية اتخاذ القرارات اإلدارية تخزين البيانات:عملية بناء واستخدام مستودعات البيانات Data Warehouse Subject-Oriented Organized around major subjects Focusing on the modeling and analysis of data for decision makers Provide a simple and concise view around particular subject issues مستودع - البيانات موضوع المنحى Data Warehouse Integrated نظمت حول مواضيع رئيسية التركيز على نمذجة وتحليل البيانات لصانعي القرار تقديم وجهة نظر بسيطة وموجزة حول قضايا موضوع معين Constructed by integrating multiple Data cleaning and data integration techniques are applied مستودع البيانات - المتكاملة شيدت من خالل دمج متعددة يتم تطبيق تقنيات تنظيف البيانات وتكامل البيانات صفحة 20

Data Warehouse Time Variant The time horizon for the data warehouse is significantly longer than that of operational systems Operational database: current value data Data warehouse data: provide information from a historical perspective Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain time element Data Warehouse Nonvolatile مستودع البيانات - متغير الوقت األفق الزمني لمستودع البيانات أطول بكثير من نظام التشغيل قاعدة البيانات التشغيلية: بيانات القيمة الحالية بيانات مستودع البيانات: تقديم معلومات من منظور تاريخي كل هيكل رئيسي في مستودع البيانات يحتوي على عنصر من الوقت صراحة أو ضمنا ولكن مفتاح البيانات التشغيلية قد أو قد ال تحتوي على "عنصر الوقت" A physically separate store of data transformed from the operational environment Operational update of data does not occur in the data warehouse environment Requires only two operations in data accessing: initial loading of data and access of data مستودع البيانات - غير متطاير مخزن منفصل جسديا للبيانات المحولة من البيئة التشغيلية ال يحدث التحديث التشغيلي للبيانات في بيئة مستودع البيانات يتطلب عمليتين فقط في الوصول إلى لتحميل األولي للبيانات والوصول إلى البيانات البيانات: Data Warehouse vs. Heterogeneous DBMS heterogeneous DB integration: A query driven mediators on top of heterogeneous databases Complex information filtering When a query is posed to a client site Data warehouse: update-driven صفحة 21

Information from heterogeneous sources is integrated غير المتجانسة DBMS vs. مستودع البيانات استعالم مدفوعة : التكامل دب غير المتجانسة وسطاء على قاعدة بيانات غير متجانسة تصفية المعلومات المعقدة عند طرح استعالم على موقع عميل يحركها التحديث : مستودع البيانات يتم دمج المعلومات من المصادر غير المتجانسة OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data current, up-to-date detailed, flat relational isolated repetitive historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans unit of work read/write index/hash on prim. key short, simple transaction # records accessed tens millions #users thousands hundreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response usage access complex query Why Separate Data Warehouse? High performance for both systems DBMS tuned for OLTP: access methods, Warehouse tuned for OLAP: complex Different functions and different data: missing data: Decision support requires historical data which operational DBs do not typically maintain data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources 22 صفحة data quality: different sources typically use inconsistent data representations

مستودع لماذا البيانات المنفصلة ط: أداء عال لكال النظامين ضبطها لOLTP dbs رق الوصول مستودع ضبطها لOLAP : مجمع وظائف مختلفة وبيانات مختلفة: البيانات المفقودة: يتطلب دعم القرار بيانات تاريخية ال تحتفظ بها dbs التشغيلية عادة توحيد البيانات: يتطلب دس تجميع )تجميع تلخيص( للبيانات من مصادر غير متجانسة جودة البيانات: مصادر مختلفة عادة استخدام تمثيل البيانات متناسقة غير Conceptual Modeling of Data Warehouses? Modeling data warehouses: dimensions & measures Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema النمذجة المفاهيمية لمستودعات البيانات نمذجة مستودعات البيانات: األبعاد والمقاييس مالحظه: مخطط المخطط اومخطط النجمه: جدول حقائق في الوسط متصل بمجموعة من جداول األبعاد مخطط كرات الثلج: صقل مخطط النجوم حيث يتم تطبيع بعض التسلسل الهرمي األبعاد في مجموعة من الجداول البعد أصغر وتشكيل شكل مماثل لندفة الثلج األبراج الحقيقة: جداول الحقائق متعددة تشترك الجداول البعد ينظر إليها على أنها مجموعة من النجوم وبالتالي يسمى مخطط المجرة. الرسمات في ساليدات) 72,73,71 )مهمه Measures of Data Cube: Three Categories Distributive: if the result derived by applying the function to n aggregate values صفحة 23

Algebraic: if it can be computed by an algebraic function with M arguments Holistic: if there is no constant bound on the storage size needed to describe a subaggregate تدابير مكعب البيانات: ثالث فئات توزيعي: إذا كانت النتيجة مستمدة من خالل تطبيق الدالة على قيم n جبري: إذا كان يمكن حسابها من قبل وظيفة جبري مع الحجج M شمولية: إذا لم يكن هناك ثابت على حجم التخزين الالزمة لوصف تصنيف فرعي Typical OLAP Operations drill-up:summarize data roll down:reverse of roll-up Slice and dice: project and select Pivot drill across: involving (across) more than one fact table drill through: through the bottom level of the cube عملياتOLAP النموذجية الحفر: تلخيص البيانات التراجع: عكس العرض اإلجمالي شريحة والزهر: المشروع وحدد محور الحفر عبر: تشمل )عبر( أكثر من جدول الحقائق واحد الحفر من خالل: من خالل المستوى السفلي من المكعب Design of Data Warehouse: A Business Analysis Framework Four views regarding the design of a data warehouse Top-down view information necessary Data source view exposes the information being captured, Data warehouse view fact tables and dimension tables Business query view view of end-user مستودع تصميم البيانات: إطار عمل تحليل األعمال أربعة آراء بشأن تصميم مستودع البيانات عرض من أعلى ألسفل المعلومات الالزمة عرض مصدر البيانات يعرض المعلومات التي يتم التقاطها عرض مستودع البيانات صفحة 24

جداول الحقائق والجداول البعد عرض استعالم األعمال عرض المستخدم النهائي Data Warehouse Desig Process Top-down, bottom-up approaches or a combination of both Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid) From software engineering point of view Waterfall: structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems, short turn around time, quick turn around Typical data warehouse design process Choose a business process to model Choose the grain Choose the dimensions that will apply to each fact table record Choose the measure that will populate each fact table record مستودع التصميم عملية البيانات النهج من أعلى إلى أسفل أو من أسفل إلى أعلى أو مزيج من االثنين معا من أعلى إلى أسفل: يبدأ مع التصميم العام والتخطيط )ناضجة( من أسفل إلى أعلى: يبدأ مع التجارب والنماذج )السريع( من وجهة نظر هندسة البرمجيات الشالل: تحليل منظم ومنهجي في كل خطوة قبل الشروع في الخطوة التالية لولبية: الجيل السريع من أنظمة تعمل بشكل متزايد بدوره قصيرة حول الوقت بدوره السريع حولها نموذجية عملية تصميم مستودع البيانات اختيار عملية تجارية للنموذج مثل الطلبات والفواتير وما إلى ذلك. اختيار الحبوب اختر األبعاد التي ستطبق على كل سجل جدول حقائق اختر المقياس الذي سيمأل كل سجل جدول حقائق صفحة 25

Data Warehouse: A Multi-Tiered Architecture Three Data Warehouse Models Enterprise warehouse the information about the entire organization Data Mart a subset of corporate-wide data Virtual warehouse summary views البيانات مستودع نماذج ثالثة Metadata Repository المؤسسة مستودع المنظمة المعلومات عنن مارت البيانات فرعية منن مجموعة الظاهري مستودع الموجزة النظر وجهات بأكملها الشركة مستوى البيانات ععل ىلى Meta data is the data defining warehouse objects. صفحة 26

Operational meta-data The algorithms used for summarization Data related to system performance Business data التعريف بيانات مستودع البيانات التعريفية هي البيانات التي تحدد كائنات المستودع. البيانات الفوقية التشغيلية الخوارزميات المستخدمة للتلخيص البيانات المتعلقة بأداء النظام بيانات األعمال OLAP Server Architectures Relational OLAP (ROLAP) Multidimensional OLAP (MOLAP) Hybrid OLAP (HOLAP) OLAPخادم المالمح العالئقية( ROLAP ) OLAP متعددة األبعاد( MOLAP ) OLAP الهجين( HOLAP ) OLAP From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM) Why online analytical mining? High quality of data in data warehouses Available information processing structure surrounding data warehouses OLAP-based exploratory data analysis On-line selection of data mining functions )OLAP( المعالجة من الخط على التحليلية )OLAM( التحليلي الخط إلى التعدين لماذا التعدين التحليلي على االنترنت جودة عالية للبيانات في مستودعات البيانات المتاحة هيكل معالجة المعلومات المحيطة مخازن البيانات تحليل البيانات االستكشافية المستندة إلىolap على الخط اختيار وظائف استخراج البيانات An OLAM System Architecture صفحة 27

Chapter4: Why Is Frequent Pattern Mining Important? Discloses an intrinsic and important property of data sets Classification Cluster analysis Data warehousing Broad applications لماذا هو نمط متكرر التعدين المهم يكشف خاصية جوهرية وهامة من مجموعات البيانات تصنيف تحليل المجموعات تخزين البيانات تطبيقات واسعة النطاق Basic Concepts: Frequent Patterns and Association Rules support, s, probability that a transaction contains X Y confidence, c, conditional probability that a transaction having X also contains Y : مفاهيم أساسية أنماط متكررة وقواعد الرابطة Y ب X احتمال أن تحتوي المعاملة على s الدعم Y يحتوي أيضا X االحتمال الشرطي أن المعاملة التي تحتوي على ج الثقة Scalable Methods for Mining Frequent Patterns Scalable mining methods: Three major approaches Apriori Freq. pattern growth Vertical data format approach طرق قابلة للتعدين أنماط متكررة ثالثة نهج رئيسية : التعدين قابلة للتطوير نحو استداللي نمو النمط. التكرار نهج تنسيق البيانات الرأسي طرق Apriori: A Candidate Generation-and-Test Approach 29 صفحة

Method: Initially, Generate length Test the candidates against DB Terminate when no frequent or candidate set can be generated جيل المرشح واالختبار النهج الطريقة: في البداية توليد طول اختبار المرشحين ضد db ينتهي عند عدم إنشاء مجموعة متكررة أو مرشح Important Details of Apriori Algorithm How to generate candidates? Step 1: self-joining Step 2: pruning تفاصيل هامة من خوارزمية كيفية توليد المرشحين الخطوة 1: االنضمام الذاتي الخطوة 2: التقليم صفحة 30

How to Count Supports of Candidates? Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method: Candidate are stored in a hash-tree Leaf node of hash-tree Interior node contains a hash table Subset function كيفية حساب دعم المرشحين لماذا يعد دعم دعم المرشحين مشكلة العدد اإلجمالي للمرشحين يمكن أن تكون ضخمة جدا قد تحتوي معاملة واحدة على العديد من المرشحين : الطريقة يتم تخزين المرشح في شجرة التجزئة ورقة عقدة من شجرة التجزئة تحتوي العقدة الداخلية على جدول تجزئة وظيفة مجموعة فرعية Challenges of Frequent Pattern Mining Challenges Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates Improving Apriori: general ideas Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates تحديات نمط التعدين المتكرر التحديات بمسح متعددة من قاعدة بيانات المعامالت -1 عدد كبير من المرشحين -2 عبء العمل المملة للدعم العد للمرشحين -3 األفكار العامة : تحسين نحو استداللي لحد يمر من المعامالت بمسح قاعدة البيانات -1 تقليص عدد المرشحين -2 تيسير الدعم فرز المرشحين -3 Mining Multi-Dimensional Association Single-dimensional rules Multi-dimensional rules Inter-dimension assoc. hybrid-dimension assoc 31 صفحة

Categorical Attributes Quantitative Attributes Chapter5: Classification vs. Prediction Classification: predicts categorical class labels classifies data Prediction: predicts unknown or missing values Typical Applications credit approval target marketing Classification A Two-Step Process Data: training data using model construction testing data using model usage training data: robot model rules decision trees formulae Classification Process (1): Model Construction التعدين متعددة األبعاد جمعية قواعد أحادية البعد قواعد متعددة األبعاد بين البعد أسوك. الهجين البعد أسوك سمات القاطع السمات الكمية صفحة 32

Supervised vs. Unsupervised Learning صفحة 33

Supervised learning (classification) Supervision: The training data are accompanied by labels Unsupervised learning (clustering) The class labels of training data is unknown Issues (2): Evaluating Classification Methods Predictive accuracy Speed and scalability Robustness Scalability Regress Analysis and Log-Linear Models in Prediction o Linear regression o Multiple regression o Log-linear models Extracting Classification Rules from Trees Example IF age = <=30 AND student = no THEN buys_computer = no IF age = <=30 AND student = yes THEN buys_computer = yes IF age = 31 40 THEN buys_computer = yes صفحة 34

IF age = >40 AND credit_rating = excellent THEN buys_computer = yes IF age = >40 AND credit_rating = fair THEN buys_computer = no Classification by Decision Tree Induction Decision tree Decision tree generation consists of two phases Tree construction At start Tree pruning remove noise or outliers Use of decision tree Chapter 6: What is Cluster Analysis? Cluster: a collection of data Similar to one another within the same cluster Clustering: Rich Applications and Multidisciplinary Efforts Pattern Recognition Spatial Data Analysis Image Processing Economic Science WWW Quality: What Is Good Clustering? high intra-class similarity low inter-class similarity Measure the Quality of Clustering Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function The definitions of distance functions are usually very different for interval-scaled Requirements of Clustering in Data Mining Scalability High dimensionality Ability to handle dynamic data Discovery of clusters with arbitrary shape صفحة 35

Major Clustering Approaches Partitioning approach: Construct various partitions and then evaluate them by some criterion, Typical methods: k-means Hierarchical approach: Create a hierarchical decomposition of the set of data Typical methods: ROCK Density-based approach: Based on connectivity and density functions Typical methods: DBSACN, Typical Alternatives to Calculate the Distance between Clusters Single link Complete link Average Centroid Medoid Heuristic methods: k-means and k- medoids algorithms? k-means : Each cluster is represented by the center of the cluster k-medoids or PAM (Partition around medoids) Each cluster is represented by one of the objects in the cluster The K-Means Clustering Method Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition Assign each object to the cluster with the nearest seed point Comments on the K-Means Method? Strength: Relatively efficient: Comment: Often terminates at a local optimum. Weakness Applicable only when mean is defined Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers صفحة 36

What Is the Problem of the K-Means Method? The k-means algorithm is sensitive to outliers! large value may substantially distort the distribution of the data. K-Medoids: Instead of taking the mean value of the object in a cluster صفحة 37