A Novel Algorithm for Automatic Document Clustering

Data mining, Web | Desktop Application
A Novel Algorithm for Automatic Document Clustering Internet has become an indispensible part of today’s life. World Wide Web (WWW) is the largest shared information source. Finding relevant information on the WWW is challenging. To respond to a user query, it is difficult to search through the large number of returned documents with the presence of today’s search engines. There is a need to organize a large set of documents into categories through clustering. The documents can be a user query or simply a collection of documents. Document clustering is the task of combining a set of documents into clusters so that intra cluster documents are similar to each other than inter cluster documents. Partitioning and Hierarchical algorithms are commonly used for document clustering. Existing partitioning algorithms have the limitation…
Read More

Dynamic Personalized Recommendation on Sparse Data

Data mining, Web | Desktop Application
Dynamic Personalized Recommendation on Sparse Data Recommendation techniques are very important in the fields of E-commerce and other Web-based services. One of the main difficulties is dynamically providing high-quality recommendation on sparse data. In this paper, a novel dynamic personalized recommendation algorithm is proposed, in which information contained in both ratings and profile contents are utilized by exploring latent relations between ratings, a set of dynamic features are designed to describe user preferences in multiple phases, and finally a recommendation is made by adaptively weighting the features. Experimental results on public datasets show that the proposed algorithm has satisfying performance.
Read More

Efficient Algorithms for Mining High Utility Itemsets from Transactional Databases

Data mining, Web | Desktop Application
Efficient Algorithms for Mining High Utility Itemsets from Transactional Databases Mining high utility itemsets from a transactional database refers to the discovery of itemsets with high utility like profits. Although a number of relevant algorithms have been proposed in recent years, they incur the problem of producing a large number of candidate itemsets for high utility itemsets. Such a large number of candidate itemsets degrades the mining performance in terms of execution time and space requirement. The situation may become worse when the database contains lots of long transactions or long high utility itemsets. In this paper, we propose two algorithms, namely utility pattern growth (UP-Growth) and UP-Growth+, for mining high utility itemsets with a set of effective strategies for pruning candidate itemsets. The information of high utility itemsets is…
Read More

Sensitive Label Privacy Protection on Social Network Data

Data mining, Web | Desktop Application
Sensitive Label Privacy Protection on Social Network Data This paper is motivated by the recognition of the need for a ner grain and more personalized privacy in data publication of social networks. We propose a privacy protection scheme that not only prevents the disclosure of identity of users but also the disclosure of selected features in users' pro les. An individual user can select which features of her pro le she wishes to conceal. The social networks are modeled as graphs in which users are nodes and features are labels. Labels are denoted either as sensitive or as non-sensitive. We treat node labels both as background knowledge an adversary may possess, and as sensitive information that has to be protected. We present privacy protection algorithms that allow for graph data to be…
Read More

Privacy against Aggregate Knowledge Attacks

Data mining, Web | Desktop Application
Privacy against Aggregate Knowledge Attacks This paper focuses on protecting the privacy of individuals in publication scenarios where the attacker is ex- pected to have only abstract or aggregate knowledge about each record. Whereas, data privacy research usually focuses on defining stricter privacy guarantees that assume increasingly more sophisticated attack scenarios, it is also important to have anonymization methods and guarantees that will address any attack scenario. Enforcing a stricter guarantee than required increases unnecessarily the information loss. Consider for example the publication of tax records, where attackers might only know the total income, and not its con- stituent parts. Traditional anonymization methods would pro- tect user privacy by creating equivalence classes of identical records. Alternatively, in this work we propose an anonymization technique that generalizes attributes, only as much…
Read More

Adapting a Ranking Model for Domain-Specific Search

Data mining, Web | Desktop Application
Adapting a Ranking Model for Domain-Specific Search An adaptation process is described to adapt a ranking model constructed for a broad-based search engine for use with a domain-specific ranking model. It’s difficult to applying the broad-based ranking model directly to different domains due to domain differences, to build a unique ranking model for each domain it time-consuming for training models. In this paper,we address these difficulties by proposing algorithm called ranking adaptation SVM (RA-SVM), Our algorithm only requires the prediction from the existing ranking models, rather than their internal representations or the data from auxiliary domains The ranking model is adapted for use in a search environment focusing on a specific segment of online content, for example, a specific topic, media type, or genre of content. a domain-specific ranking model…
Read More

Efficient Similarity Search over Encrypted Data

Data mining, Web | Desktop Application
Efficient Similarity Search over Encrypted Data amount of data have been stored in the cloud. Although cloud based services offer many advantages, privacy and security of the sensitive data is a big concern. To mitigate the concerns, it is desirable to outsource sensitive data in encrypted form. Encrypted storage protects the data against illegal access, but it complicates some basic, yet important functionality such as the search on the data. To achieve search over encrypted data without compromising the privacy, considerable amount of searchable encryption schemes have been proposed in the literature. However, almost all of them handle exact query matching but not similarity matching; a crucial requirement for real world applications. Although some sophisticated secure multi-party computation based cryptographic techniques are available for similarity tests, they are computationally intensive…
Read More

A Bayesian Approach to Filtering Junk E-Mail

Data mining, Web | Desktop Application
A Bayesian Approach to Filtering Junk E-Mail Abstract In addressing the growing problem of junk E-mail on the Internet, we examine methods for the automated construction of lters to eliminate such unwanted mes- sages from a user's mail stream. By casting this prob- lem in a decision theoretic framework, we are able to make use of probabilistic learning methods in conjunc- tion with a notion of di erential misclassi cation cost to produce lters which are especially appropriate for the nuances of this task. While this may appear, at rst, to be a straight-forward text classi cation prob- lem, we show that by considering domain-speci c fea- tures of this problem in addition to the raw text of E-mail messages, we can produce much more accurate lters. Finally, we show the ecacy of such…
Read More

Opinion Mining for web search

Data mining, Web | Desktop Application
Opinion Mining for web search Generally, search engine retrieves the information using Page Rank, Distance vector algorithm, crawling, etc. on the basis of the user’s query. But it may happen that the links retrieved by search engine are may or may not be exactly related to the user’s query and user has to check all the links to know whether the needed information is present in the document or not, it becomes a tedious and time consuming job for the user. Our focus is to cluster different documents based on subjective similarities and dissimilarities. Our proposed tool ‘Web Search Miner’  which is based on the concept of  user opinions mining, which uses k-means search algorithm and distance measure based on Term frequency & web document frequency for mining the search…
Read More

Distributed Association rule mining : Market basket Analysis

Data mining
Distributed Association rule mining : Market basket Analysis Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.
Read More

web usage mining using apriori

Data mining, Web | Desktop Application
web usage mining using apriori The enormous content of information on the World Wide Web makes it obvious candidate for data mining research. Application of data mining techniques to the World Wide Web referred as Web mining where this term has been used in three distinct ways; Web Content Mining, Web Structure Mining and Web Usage Mining. E Learning is one of the Web based application where it will facing with large amount of data. In order to produce the E-Learning  portal usage patterns and user behaviors, this paper implements the high level process of Web Usage Mining using advance Association Rules algorithm  call D-Apriori Algorithm. Web Usage Mining consists of three main phases, namely Data Preprocessing, Pattern Discovering and Pattern Analysis. Server log files become a set of raw…
Read More

Sales & Inventory Prediction using Data Mining

Data mining, Web | Desktop Application
Sales & Inventory Prediction using Data Mining Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.
Read More

Hiding Sensitive Association Rule for Privacy Preservation

Data mining, Web | Desktop Application
Hiding Sensitive Association Rule for Privacy Preservation Data mining techniques have been widely used in various applications. However, the misuse of these techniques may lead to the disclosure of sensitive information. Researchers have recently made efforts at hiding sensitive association rules. Nevertheless, undesired side effects, e.g., non sensitive rules falsely hidden and spurious rules falsely generated, may be produced in the rule hiding process. In this paper, we present a novel approach that strategically modifies a few transactions in the transaction database to decrease the supports or confidences of sensitive rules without producing the side effects. Since the correlation among rules can make it impossible to achieve this goal, in this paper, we propose heuristic methods for increasing the number of hidden sensitive rules and reducing the number of modified…
Read More

Efficiency of content distribution via network coding

Networking
Efficiency of content distribution via network coding Content distribution via network coding has received a lot of attention lately. However, direct application of network coding may be insecure. In particular, attackers can inject “bogus” data to corrupt the content distribution process so as to hinder the information dispersal or even deplete the network resource. Therefore, content verification is an important and practical issue when network coding is employed. When random linear network coding is used, it is infeasible for the source of the content to sign all the data, and hence, the traditional “hash-and-sign” methods are no longer applicable. Recently, a new on-the-fly verification technique has been proposed by Krohn et al. (IEEE S&P ’04), which employs a classical homomorphic hash function. However, this technique is difficult to be applied…
Read More

Effective Pattern Discovery for Text Mining

Data mining, Web | Desktop Application
Effective Pattern Discovery for Text Mining Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase)-based approaches should perform better than the term-based ones, but many experiments do not support this hypothesis. This paper presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information.
Read More

Data leakage Detection

Security and Encryption, Web | Desktop Application
Data leakage Detection While doing business, sometimes sensitive data must be handed over to supposedly trusted third parties. For example, a hospital may give patient records to researchers who will devise new treatments. Similarly, a company may have partnerships with other companies that require sharing customer data. Another enterprise may outsource its data processing, so data must be given to various other companies. We call the owner of the data the distributor and the supposedly trusted third parties the agents. Our goal is to detect when the distributor’s sensitive data has been leaked by agents, and if possible to identify the agent that leaked the data. We consider applications where the original sensitive data cannot be perturbed. Perturbation is a very useful technique where the data is modified and made…
Read More

Medical Disease diagnosis using Data Mining

Data mining, Web | Desktop Application
Medical Disease diagnosis using Data Mining The healthcare industry collects a huge amount of data which is not properly mined and not put to the optimum use. Discovery of these hidden patterns and relationships often goes unexploited. Our research focuses on this aspect of Medical diagnosis by learning pattern through the collected data of diabetes, hepatitis and heart diseases and to develop intelligent medical decision support systems to help the physicians. In this paper, we propose the use of decision trees C4.5 algorithm, ID3 algorithm and CART algorithm to classify these diseases and compare the effectiveness, correction rate among them.
Read More