Project List

Home / Project List

Vector-based Sentiment Analysis of Movie Reviews

Artificial Intelligence & ML, Data mining, Machine Learning
Vector-based Sentiment Analysis of Movie Reviews We investigate sentence sentiment using the Pang and Lee dataset as annotated by Socher, et al. [1]. Sentiment analysis research focuses on understanding the positive or negative tone of a sentence based on sentence syntax, structure, and content. Previous research used a tree-based model to label sentence sentiment on a scale of 5 points. Our project takes a different approach of abstracting the sentence as a vector and apply vector classification schemes. We explore two components: first, we would like to analyze the use of different sentence representations, such as bag of words, word sentiment location, negation, etc., and abstract them into a set of features. Second, we would like to classify sentence sentiment using this set of features and compare the effectiveness of…
Read More

Using Tweets for single stock price prediction

Artificial Intelligence & ML, Data mining, Machine Learning
Using Tweets for single stock price prediction Social media, as the collective form of individual opinions and emotions, has very profound though maybe subtle relationship with social events. This is particularly true when it comes to public Tweets and stock trading. In fact, research has shown that when it comes to financial decisions, people are significantly driven by emotions [1]. These emotions, together with people’s opinions, are in real-time reflected by tweets. As a result, by analyzing relevant tweets using proper machine learning algorithms, one could grasp the public’s sentiment as well as attitude towards the stock’s price of interest, which could intuitively predict the next move of it. Some previous work has been done to show that tweets can indeed reflect stock price change. Bollen. Etc (2010) randomly selected…
Read More

Recommendation based on user experiences

Artificial Intelligence & ML, Data mining, Machine Learning
Recommendation based on user experiences Recommender systems follow 2 main strategies: contentbased filtering and collaborative filtering. Collaborative is often the preffered approach as it requires no domain knowledge and no feature gathering effort. The 2 primary methods for collaborative filtering are latent factor models and neighborhood methods. In user-user neighbourhood methods, similarity between users is measured by transforming them into the item space. Similar logic applies to item-item similarity. In latent factor methods, both user and items are transfomed into a latent featuee space. An item is recommended to a user if thu are similar, their vector representation in the latent feature spase is relatively high. We select latent factor model because it allows us to identify the hidden feature of the users. These features are time indepedent. We first…
Read More

Learning To Predict Dental Caries For Preschool Children

Artificial Intelligence & ML, Data mining, Machine Learning
Learning To Predict Dental Caries For Preschool Children Dental caries, or tooth decay/cavity, is a dental disease caused by bacterial infection. Of people from different age groups, preschooler children requires more attention since caries has become the most common chronic childhood diseases. More importantly, a skewed distribution of the diseases has been observed in Europe, US and Singapore among the children or preschoolers, which indicate a small portion of the population endures a big portion of caries incidences. Therefore, there is still the need to improve on the current caries control to identify the high-risk individuals and prevent resurgence in children in developed countries like Singapore. Our project will study on the data such as questionnaire responses, oral examination and biological tests of certain preschoolers from Singapore and use suitable…
Read More

Predicting air pollution level in a specific city

Artificial Intelligence & ML, Data mining, Machine Learning
Predicting air pollution level in a specific city The regulation of air pollutant levels is rapidly becoming one of the most important tasks for the governments of developing countries, especially China. Among the pollutant index, Fine particulate matter (PM2.5) is a significant one because it is a big concern to people's health when its level in the air is relatively high. PM2.5 refers to tiny particles in the air that reduce visibility and cause the air to appear hazy when levels are elevated. However, the relationships between the concentration of these particles and meteorological and traffic factors are poorly understood. To shed some light on these connections, some of these advanced techniques have been introduced into air quality research. These studies utilized selected techniques, such as Support Vector Machine (SVM)…
Read More

Sentiment Analysis on Movie Reviews

Artificial Intelligence & ML, Data mining, Machine Learning
Sentiment Analysis on Movie Reviews Sentiment analysis is a well-known task in the realm of natural language processing. Given a set of texts, the objective is to determine the polarity of that text. [9] provides a comprehensive survey of various methods, benchmarks, and resources of sentiment analysis and opinion mining. The sentiments can consist of different classes. In this study, we consider two cases: 1) A movie review is positive (+) or negative (-). This is similar to [2], where they also employ a novel similarity measure. In [10], authors perform sentiment analysis after summarizing the text. 2) A movie review is very negative (- -), somewhat negative (-), neutral (o), somewhat positive (+), or very positive (+ +). For the first case, we picked a Kaggle1 competition called “Bag…
Read More

Predicting Soccer Results in the English Premier League

Artificial Intelligence & ML, Data mining, Machine Learning
Predicting Soccer Results in the English Premier League There were many displays of genius during the 2010 World Cup, ranging from Andrew Iniesta to Thomas Muller, but none were as unusual as that of Paul the Octopus. This sea dweller correctly chose the winner of a match all eight times that he was tested. This accuracy contrasts sharply with one of our team member’s predictions for the World Cup, who was correct only about half the time. Due to love of the game, and partly from the shame of being outdone by an octopus, we have decided to attempt to predict the outcomes of soccer matches. This has real world applications for gambling, coaching improvements, and journalism. Out of the many leagues we could have chosen, we decided upon the…
Read More

Classifying Online User Behavior Using Contextual Data

Artificial Intelligence & ML, Data mining, Machine Learning
Classifying Online User Behavior Using Contextual Data Despite the great computational power of machines, there a some things like interest-based segregation that only humans can instinctively distinguish. For example, a human can easily tell whether a tweet is about a book or about a kitchen utensil. However, to write a rule-based computer program to solve this task, a programmer must lay down very precise criteria for this these classifications. There has been a massive increase in the amount of structured user-generated content on the Internet in the form of tweets, reviews on Amazon and eBay etc. As opposed to stand-alone companies, which leverage their own hubs of data to run behavioral analytics, we strive to gain insights into online user behavior and interests based on free and public data. By…
Read More

Extracting Word Relationships from Unstructured Data

Artificial Intelligence & ML, Machine Learning
Extracting Word Relationships from Unstructured Data Robots are advancing rapidly in their behavioural functionality allowing them to perform sophisticated tasks. However, their ability to take Natural Language instructions is still in its infancy. Parsing, Semantic Intrepretation and Dialogue Management are typically performed only on a limited set of primitives, thus limiting the set of instructions that could be given to a robot. This limits a robot’s applicability in unconstrained natural environments (like households and offices) [8]. In this project, we are only addressing the problem of semantic interpretation of human instructions. Specifically, our Extracto algorithm provides a method to extract potential actions (verbs) that could be performed given two household objects (nouns). For example, given the nouns “Coffee” and “Cup”, Extracto identifies the action (verb) “pour” indicating that ‘coffee should…
Read More

Bird Species Identification from an Image

Artificial Intelligence & ML, Image Processing, Machine Learning
Predicting ground shaking intensities using DYFI data and estimating event terms to identify induced earthquakes In daily life we can hear a variety of creatures including human speech, dog barks, birdsongs, frog calls, etc. Many animals generate sounds either for communication or as a by product of their living activities such as eating, moving, flying, mating etc. Bird species identification is a well-known problem to ornithologists, and it is considered as a scientific task since antiquity. Technology for Birds and their sounds are in many ways important for our culture. They can be heard even in big cities and most people can recognize at least a few most common species by their sounds. Biologists tried to investigate species richness, presence or absence of indicator species, and the population sizes of…
Read More

Predicting ground shaking intensities using DYFI data and estimating event terms to identify induced earthquakes

Artificial Intelligence & ML, Machine Learning
Predicting ground shaking intensities using DYFI data and estimating event terms to identify induced earthquakes There has been a dramatic increase in seismicity in CEUS in recent years (Ellsworth 2013). There is a possibility that this increased seismicity in CEUS is caused by anthropogenic processes and is referred to as induced or triggered seismicity. The earthquakes are a nuisance for people and some larger magnitude earthquakes have also caused structural damage. Hence, it is important to quantify seismic hazard and risk from this increased seismicity. One of the major components in determining seismic hazard and risk is the expected level of ground shaking at a site. Level of ground shaking from a given earthquake is typically estimated using previously collected ground motion data in a region. However, in CEUS due…
Read More

Identifying Gender From Facial Features

Artificial Intelligence & ML, Image Processing, Machine Learning
Identifying Gender From Facial Features Previous research has shown that our brain has specialized nerve cells responding to specific local features of a scene, such as lines, edges, angles or movement. Our visual cortex combines these scattered pieces of information into useful patterns. Automatic face recognition aims to extract these meaningful pieces of information and put them together into a useful representation in order to perform a classification/identification task on them. While we attempt to identify gender from facial features, we are often curious about what features of the face are most important in determining gender. Are localized features such as eyes, nose and ears more important or overall features such as head shape, hair line and face contour more important? There are a plethora of successful and robut face…
Read More

Analyzing Positional Play in Chess using Machine Learning

Artificial Intelligence & ML, Machine Learning
Analyzing Positional Play in Chess using Machine Learning Chess has two broad approaches to game-play, tactical and positional. Tactical play is the approach of calculating maneuvers and employing tactics that take advantage of short-term opportunities, while positional play is dominated by long-term maneuvers for advantage and requires judgement more than calculations. Current generation chess engines predominantly employ tactical play and thus outplay top human players given their much superior computational abilities. Engines do so by searching game trees of depths typically between 20 and 30 moves and calculating a large number of variations. However, human play is often a combination of both, tactical and positional approaches, since humans have some intuition about which board positions are intrinsically better than others. In our project, we use machine learning to identify elements…
Read More

PREDICTING HOSPITAL READMISSION SIN THE MEDICARE POPULATION

Artificial Intelligence & ML, Data mining, Machine Learning, MSC IT
PREDICTING HOSPITAL READMISSION SIN THE MEDICARE POPULATION Avoidable hospital readmissions cost taxpayers billions of dollars each year. The Medicare Payment Advisory Commission has estimated that almost $12 billion is spent annually by Medicare on potentially preventable readmissions within 30 days of a patient’s discharge from a hospital [1]. The Medicare program has begun to apply financial penalties to hospitals that have excessive risk-adjusted readmission rates. There is much interest in the health policy and medical communities in the ability to accurately predict which patients are at high risk of being readmitted. Not only are there strong financial reasons to avoid readmissions, readmission to the hospital can be a sign of poor clinical care and can indicate a worsening of a patient’s condition [2]. If doctors and nurses were aware of…
Read More

Attribution of Contested and Anonymous Ancient Greek Works

8051 Microcontroller, Artificial Intelligence & ML, Data mining
Attribution of Contested and Anonymous Ancient Greek Works Authorship attribution has been a persistent problem in the Classical genre, as texts that reach us from antiquity are often corrupted, edited, or forged over the thousands of years since their initial production. Scholars have worked on identifying writers’ stylistic differences in an attempt to distinguish genuine texts from fakes, and to attribute an author to previously anonymous works. Increasing computing power allows the derivation of more complex features, giving us new information about each author’s linguistic signature and writing style. Our system is able to accurately predict the author of a complete anonymous work, as well as many text fragments that currently have contested authorship. We experimented with using semantic and lexical features, and explored both discriminative and generative classification algorithms.…
Read More

Object Detection for Semantic SLAM using Convolution Neural Networks

Artificial Intelligence & ML, Data mining, Image Processing, Machine Learning
Object Detection for Semantic SLAM using Convolution Neural Networks Conventional SLAM (Simultaneous Localization and Mapping) systems typically provide odometry estimates and point-cloud reconstructions of an unknown environment. While these outputs can be used for tasks such as autonomous navigation, they lack any semantic information. Our project implements a modular object detection framework that can be used in conjunction with a SLAM engine to generate semantic scene reconstructions. A semantically-augmented reconstruction has many potential applications. Some examples include: • Discriminating between pedestrians, cars, bicyclists, etc in an autonomous driving system. • Loop-closure detection based on object-level descriptors. • Smart household bots that can retrieve objects given a natural language command. An object detection algorithm designed for these applications has a unique set of requirements and constraints. The algorithm needs to be…
Read More

Sentiment as a Predictor of Wikipedia Editor Activity

Artificial Intelligence & ML, Machine Learning
Sentiment as a Predictor of Wikipedia Editor Activity Wikipedia, the worlds largest encyclopedia, is created by millions of unpaid editors online. Every user can edit every article, and the project is protected against vandalism and low-quality contributions only through version control and a system of (again unpaid) reviewers. Somewhat hidden to most casual readers of the encyclopedia, Wikipedia also features a simple social network: every user has a personal user profile and a “user talk page” which acts as a publicly accessible guestbook where users can leave messages to each other. The messages exchanged in user talk pages are often related to a user’s editing behavior. For example, senior users may welcome new users, or congratulate them on their first edits. Administrators may officially warn culprits after transgressions of Wikipedias…
Read More

Application of Neural Network In Handwriting Recognition

Artificial Intelligence & ML, Image Processing, Machine Learning
Application of Neural Network In Handwriting Recognition Handwriting recognition can be divided into two categories, namely on-line and off-line handwriting recognition. On-line recognition involves live transformation of character written by a user on a tablet or a smart phone. In contrast, off-line recognition is more challenging, which requires automatic conversion of scanned image or photos into a computerreadable text format. Motivated by the interesting application of off-line recognition technology, for instance the USPS address recognition system, and the Chase QuickDeposit system, this project will mainly focus on discovering algorithms that allow accurate, fast, and efficient character recognition process. The report will cover data acquisition, image processing, feature extraction, model training, results analysis, and future works. Image Processing Flow   Research Paper Link: Download Paper
Read More

Re-clustering of Constellations through Machine Learning

Artificial Intelligence & ML, Machine Learning
Re-clustering of Constellations through Machine Learning Since thousands of years ago, people around the world have been looking up into the sky, trying to find patterns of visible stars’ distribution, and dividing them into different groups called constellations. Originally, constellations are recognized and organized by people’s imaginations based on the shapes of the star distribution. The most two famous groups of stars is the “Big Dipper” and the “Orion”. In modern astronomy, the International Astronomical Union (IAU) has defined constellations as specific areas of the celestial sphere. These areas have their origins in star patterns from which the constellations take their names. In total, there are 88 officially recognized constellations. On the other hand, certain stars are grouped together primarily because they are close to each other and far away…
Read More

Collaborative Filtering Recommender Systems

Artificial Intelligence & ML, Data mining, Machine Learning
Collaborative Filtering Recommender Systems Collaborative filtering (CF) predicts user preferences in item selection based on the known user ratings of items. As one of the most common approach to recommender systems, CF has been proved to be effective for solving the information overload problem. CF can be divided into two main branches: memory-based and model-based. Most of the present researches improve the accuracy of Memory-based algorithms only by improving the similarity measures. But few researches focused on the prediction score models which we believe are more important than the similarity measures. The most well-known algorithm to model-based is the matrix factorization. Compared to the memory-based algorithms, matrix factorization algorithm generally has higher accuracy. However, the matrix factorization may fall into local optimum in the learning process which leads to inadequate…
Read More

Blowing up the Twittersphere: Predicting the Optimal Time to Tweet

Artificial Intelligence & ML, Data mining, Machine Learning
Blowing up the Twittersphere: Predicting the Optimal Time to Tweet We can separate our problem into a few different steps. First, we need to model information about a tweet and how successful a given tweet is. Second, given a tweet, user, and post time, we must predict how successful that tweet will be. Finally, we then need to use our predictor to determine the optimal time for a given user to post a specific tweet, i.e. what time maximizes our success prediction for a specific user and tweet. We considered two papers that address similar problems of using Machine Learning to understand interactions in social media and predict success of online content. Lakkaruja, McAuley, and Leskovec consider the connections between title, content and community in social media. From their work,…
Read More

Recognition and Classification of Fast Food Images

Artificial Intelligence & ML, Data mining, Machine Learning
Recognition and Classification of Fast Food Images Food recognition is of great importance nowadays for multiple purposes. On one hand, for people who want to get a better understanding of the food that they are not familiar of or they haven’t even seen before, they can simply take a picture and get to know more details about it. On the other hand, the increasing demand for dietary assessment tools to record the calorie and nutrition has also been a driving force in the development of food recognition technique. Therefore, automatic food recognition is very important and has great application potential. However, food varies greatly in appearance (e.g., shape, colors) with tons of different ingredients and assembling methods. This makes food recognition a difficult task for current state-of-the-art classification methods, and…
Read More

Predicting Heart Attacks

Artificial Intelligence & ML, Data mining, Machine Learning
Predicting Heart Attacks In the field of Medical Science, there are a huge amount of data. Data mining techniques are being used to discover hidden pattern form these data. Advance data mining techniques have been developed nowadays. The efficiency of these techniques is compared with sensitivity, specificity, accuracy and error rate. Some well known Data mining classification techniques, Decision Tree, Artificial neural networks, and Support Vector Machine and Naïve Bayes Classifier. In this paper, we introduce a new method based on the fitness value of the attribute to predict the heart disease problem. We use 10 attributes for our proposed method and use simple calculation. In our everyday life, there are several example exit where we have to analyze the historical data, for example, a bank loans officer needs analysis…
Read More

E-Commerce Sales Prediction Using Listing Keywords

Artificial Intelligence & ML, Data mining, Machine Learning
E-Commerce Sales Prediction Using Listing Keywords Small online retailers usually set themselves apart from brick and mortar stores, traditional brand names, and giant online retailers by offering goods at an exceptional value. In addition to price, they compete for shoppers’ attention via descriptive listing titles, whose effectiveness as search keywords can help drive sales. In this study, machine learning techniques will be applied to online retail data to measure the link between keywords and sales volumes. Architecture Research Paper Link: Download Paper
Read More

Prediction and Classification of Cardiac Arrhythmia

Artificial Intelligence & ML, Data mining, Machine Learning
Prediction and Classification of Cardiac Arrhythmia Irregularity in heartbeat may be harmless or life-threatening. Hence both accurate detection of the presence, as well as classification of arrhythmia, are important. Arrhythmia can be diagnosed by measuring the heart activity using an instrument called ECG or electrocardiograph and then analyzing the recorded data. Different parameter values can be extracted from the ECG waveforms and can be used along with other information about the patient like age, medical history, etc to detect arrhythmia. However, sometimes it may be difficult for a doctor to look at these long-duration ECG recordings and find minute irregularities. Therefore, using machine learning for automating arrhythmia diagnosis can be very helpful. The project aims at using different machine learning algorithms like Naive Bayes, SVM, Random Forests and Neural Networks…
Read More

Sentiment Analysis for Hotel Reviews

Artificial Intelligence & ML, Data mining, Machine Learning
Sentiment Analysis for Hotel Reviews Travel planning and hotel booking on the website have become one of an important commercial use. Sharing on the web has become a major tool in expressing customer thoughts about a particular product or Service. Recent years have seen rapid growth in online discussion groups and review sites (e.g.www.tripadvisor.com) where a crucial characteristic of a customer’s review is their sentiment or overall opinion — for example, if the review contains words like ‘great’, ‘best’, ‘nice’, ‘good’, ‘awesome’ is probably a positive comment. Whereas if reviews contain words like ‘bad’, ‘poor’, ‘awful’, ‘worse’ is probably a negative review. However, Trip Advisor’s star rating does not express the exact experience of the customer. Most of the ratings are meaningless, a large chunk of reviews fall in the…
Read More

Mood Detection with Tweets

Artificial Intelligence & ML, Data mining, Machine Learning
Mood Detection with Tweets Emotional states of individuals, also known as moods, are central to the expression of thoughts, ideas, and opinions, and in turn, impact attitudes and behavior. Social media tools like Twitter is increasingly used by individuals to broadcast their day-to-day happenings or to report on an external event of interest, understanding the rich „landscape‟ of moods will help us better to interpret millions of individuals. This paper describes a Rule-Based approach, which detects the emotion or mood of the tweet and classifies the twitter message under the appropriate emotional category. The accuracy with the system is 85%. With the proposed system it is possible to understand the deeper levels of emotions i.e., finer grained instead of sentiment i.e., coarse-grained. The sentiment says whether the tweet is positive…
Read More

3D Scene Retrieval from Text with Semantic Parsing

Artificial Intelligence & ML, Machine Learning
3D Scene Retrieval from Text with Semantic Parsing We look at the task of 3D scene retrieval: given a natural-language description and a set of 3D scenes, identify a scene matching the description. Geometric specifications of 3D scenes are part of the craft of many graphical computing applications, including computer animation, games, and simulators. Large databases of such scenes have become available in recent years as a result of improvements in the ease of use of tools for 3D scene design. A system that can identify a 3D scene from a natural language description is useful for making such databases of scenes readily accessible. Natural language has evolved to be well-suited to describing our (three-dimensional) world, and it provides a convenient way of specifying the space of acceptable scenes: a…
Read More

Parking Occupancy Prediction and Pattern Analysis

Artificial Intelligence & ML, Data mining, Machine Learning
Parking Occupancy Prediction and Pattern Analysis According to the Department of Parking and Traffic, San Francisco has more cars per square mile than any other city in the US [1]. The search for an empty parking spot can become an agonizing experience for the city’s urban drivers. A recent article claims that drivers cruising for a parking spot in SF generate 30% of all downtown congestion [2]. These wasted miles not only increase traffic congestion, but also lead to more pollution and driver anxiety. In order to alleviate this problem, the city armed 7000 metered parking spaces and 12,250 garages spots (total of 593 parking lots) with sensors and introduced a mobile application called SFpark [3], which provides real time information about availability of a parking lot to drivers. However,…
Read More

Stock Trend Prediction with Technical Indicators using SVM

Artificial Intelligence & ML, Machine Learning
Stock Trend Prediction with Technical Indicators using SVM Short-term prediction of stock price trend has potential application for personal investment without high-frequency-trading infrastructure. Unlike predicting market index (as explored by previous years’ projects), a single stock price tends to be affected by large noise and long-term trend inherently converges to the company’s market performance. So this project focuses on short-term (1-10 days) prediction of stock price trend and takes the approach of analyzing the time series indicators as features to classify trend (Raise or Down). The validation model is chosen so that the testing set always follows the training set in the time span to simulate real prediction. Cross-validated Grid Search on parameters of RBF-kernelized SVM is performed to fit the training data to balance the bias and variances. Although…
Read More

Predicting Usefulness of Yelp Reviews

Artificial Intelligence & ML, Data mining, Machine Learning
Predicting the Usefulness of Yelp Reviews The Yelp Dataset Challenge makes a huge set of user, business, and review data publicly available for machine learning projects. They wish to find interesting trends and patterns in all of the data they have accumulated. Our goal is to predict how useful a review will prove to be to users. We can use review upvotes as a metric. This could have immediate applications – many people rely on Yelp to make consumer choices, so predicting the most helpful reviews to display on a page before they have actually been rated would have a serious impact on user experience. Research Paper Link: Download Paper
Read More

Facial Keypoints Detection

Artificial Intelligence & ML, Image Processing
Facial Keypoints Detection Nowadays, facial key points detection has become a very popular topic and its applications include Snapchat, How old are you, have attracted a large number of users. The objective of facial key points detection is to find the facial key points in a given face, which is very challenging due to very different facial features from person to person. The idea of deep learning has been applied to this problem, such as neural network and cascaded neural network. And the results of these structures are significantly better than state-of-the-art methods, like feature extraction and dimension reduction algorithms. In our project, we would like to locate the key points in a given image using deep architectures to not only obtain lower loss for the detection task but also…
Read More

Multiclass Classifier Building with Amazon Data to Classify Customer Reviews into Product Categories

Artificial Intelligence & ML, Data mining, Machine Learning
Multiclass Classifier Building with Amazon Data to Classify Customer Reviews into Product Categories - E-commerce refers to the Electronic Commerce and defined as buying and selling of products over electronic systems such as the Internet. With the widespread use of the Internet, the trade conducted electronically (online) has grown extraordinarily. The E-commerce companies have a large database of products and a number of consumers that use these data. To address this data and information explosion, e-commerce stores are applying machine learning to identify and customize the product category information. Data scientists in this field are utilizing machine learning potential to build unmatched competitiveness in the market by finding purchase preferences, customer churn and product suggestions etc. Applying popular Machine Learning algorithms to huge datasets brought new challenges for the ML…
Read More

An Energy Efficient Seizure Prediction Algorithm

Artificial Intelligence & ML, Machine Learning
An Energy Efficient Seizure Prediction Algorithm Epileptic seizures afflict over 1% of the world’s population. If seizures could be predicted before they occur, fast-acting therapies could be delivered to prevent the attack and restore a normal quality of life to patients. Over the last two decades, several studies have explored the use of EEG signals to predict seizures using principles from machine learning [1]–[3]. It is thought that such an algorithm could be implemented in real-time with a wireless, implanted EEG sensor. However, there are two main constraints for such a portable system. First, due to limited battery life, energy consumption must be minimal. Second, due to limited bandwidth, the data transmitted between the sensor and the central processing device (such as mobile phone, tablet, personal computer, etc.) should be…
Read More

Classifier Comparisons On Credit Approval Prediction

Artificial Intelligence & ML, Machine Learning, MSC IT
Classifier Comparisons On Credit Approval Prediction The objective of this work is to investigate the performance of different classification algorithms using WEKA tool for credit card approval. A major problem in financial analysis is to build an ultimate model that yields fruitful results on certain given information. Neither a single data mining model fulfills all business requirements nor does a business need depend on a single model. Different models must be evaluated to attain the ultimate model. This kind of difficulty could be resolved with the aid of machine learning which could be used directly to obtain the end result with the aid of several artificial intelligent algorithms which perform the role of classifiers. Classification algorithms always find a rule or set of rules to represent data in classes [1].…
Read More

Automatic Number Plate Recognition System

Artificial Intelligence & ML, Machine Learning
Automatic Number Plate Recognition System The Automatic number plate recognition (ANPR) is a mass surveillance method that uses optical character recognition on images to read the license plates on vehicles. They can use existing closed-circuit television or road-rule enforcement cameras, or ones specifically designed for the task. They are used by various police forces and as a metmachinhod of electronic toll collection on pay-per-use roads and monitoring traffic activity, such as red light adherence in an intersection. ANPR can be used to store the images captured by the cameras as well as the text from the license plate, with some configurable to store a photograph of the driver. Systems commonly use infrared lighting to allow the camera to take the picture at any time of the day. A powerful flash…
Read More

Practical Approximate k Nearest Neighbor Queries with Location and Query Privacy

Artificial Intelligence & ML, Data mining, Machine Learning
Practical Approximate k Nearest Neighbor Queries with Location and Query Privacy In mobile communication, spatial queries pose a serious threat to user location privacy because the location of a query may reveal sensitive information about the mobile user. In this paper, we study approximate k nearest neighbor (KNN) queries where the mobile user queries the location-based service (LBS) provider about approximate k nearest points of interest (POIs) on the basis of his current location. We propose a basic solution and a generic solution for the mobile user to preserve his location and query privacy in approximate kNN queries. The proposed solutions are mainly built on the Paillier public-key cryptosystem and can provide both location and query privacy. To preserve query privacy, our basic solution allows the mobile user to retrieve…
Read More

QUANTIFYING POLITICAL LEANING FROM TWEETS, RETWEETS, AND RETWEETERS

Artificial Intelligence & ML, Data mining, Machine Learning, MSC IT
QUANTIFYING POLITICAL LEANING FROM TWEETS, RETWEETS, AND RETWEETERS In recent years, big online social media data have found many applications in the intersection of political and computer science. Examples include answering questions in political and social science (e.g., proving/disproving the existence of media bias [3, 30] and the “echo chamber” effect [1, 5]), using online social media to predict election outcomes [46, 31], and personalizing social media feeds so as to provide a fair and balanced view of people’s opinions on controversial issues [36]. A prerequisite for answering the above research questions is the ability to accurately estimate the political leaning of the population involved. If it is not met, either the conclusion will be invalid, the prediction will perform poorly [35, 37] due to a skew towards highly vocal…
Read More

Efficient Algorithms for Mining Top-K High Utility Itemsets

Artificial Intelligence & ML, Data mining, Machine Learning
Efficient Algorithms for Mining Top-K High Utility Itemsets In recent years, shopping online is becoming more and more popular. When it needs to decide whether to purchase a product or not online, the opinions of others become important. It presents a great opportunity to share our viewpoints for various products purchase. However, people face the information overloading problem. How to mine valuable information from reviews to understand a user’s preferences and make an accurate recommendation is crucial. Traditional recommender systems consider some factors, such as user’s purchase records, product category, and geographic location. In this work, it proposes a sentiment-based rating prediction method to improve prediction accuracy in recommender systems. Firstly, it proposes a social user sentimental measurement approach and calculates each user’s sentiment on items. Secondly, it not only…
Read More

Efficient Algorithms for Mining Top-K High Utility Itemsets

Artificial Intelligence & ML, Data mining, Machine Learning
Efficient Algorithms for Mining Top-K High Utility Itemsets Mining high utility itemsets from databases is an emerging topic in data mining, which refers to the discovery of itemsets with utilities higher than a user-specified minimum utility threshold min_util. Although several studies have been carried out on this topic, setting an appropriate minimum utility threshold is a difficult problem for users. If min_util is set too low, too many high utility itemsets will be generated, which may cause the mining algorithms to become inefficient or even run out of memory. On the other hand, if min_util is set too high, no high utility itemset will be found. Setting appropriate minimum utility thresholds by trial and error is a tedious process for users. In this paper, we address this problem by proposing…
Read More

Crowd sourcing for Top-K Query Processing over Uncertain Data

Artificial Intelligence & ML, Data mining
Crowdsourcing for Top-K Query Processing over Uncertain Data Querying uncertain data has become a prominent application due to the proliferation of user-generated content from social media and of data streams from sensors. When data ambiguity cannot be reduced algorithmically, crowdsourcing proves a viable approach, which consists in posting tasks to humans and harnessing their judgment for improving the confidence about data values or relationships. This paper tackles the problem of processing top-K queries over uncertain data with the help of crowdsourcing to quickly converging to the real ordering of relevant results. Several offline and online approaches for addressing questions to a crowd are defined and contrasted on both synthetic and real datasets, with the aim of minimizing the crowd interactions necessary to find the real ordering of the result set.…
Read More

Cyberbullying Detection based on Semantic-Enhanced Marginalized Denoising Auto-Encoder

Artificial Intelligence & ML, Data mining
Cyberbullying Detection based on Semantic-Enhanced Marginalized Denoising Auto-Encoder As a side effect of increasingly popular social media, cyberbullying has emerged as a serious problem afflicting children, adolescents, and young adults. Machine learning techniques make automatic detection of bullying messages in social media possible, and this could help to construct a healthy and safe social media environment. In this meaningful research area, one critical issue is robust and discriminative numerical representation learning of text messages. In this paper, we propose a new representation learning method to tackle this problem. Our method named semantic-enhanced marginalized denoising auto-encoder (smSDA) is developed via a semantic extension of the popular deep learning model stacked denoising autoencoder (SDA). The semantic extension consists of semantic dropout noise and sparsity constraints, where the semantic dropout noise is designed…
Read More

Mining Facets For Queries From Their Search Results

Artificial Intelligence & ML, Data mining, Machine Learning
Mining Facets For Queries From Their Search Results A query facet is a set of items which describe and summarize one important aspect of a query. Here a facet item is typically a word or a phrase. A query may have multiple facets that summarize the information about the query from different perspectives. For the query “watches”, its query facets cover the knowledge about watches in five unique aspects, including brands, gender categories, supporting features, styles, and colors. The query “visit Beijing” has a facet about popular resorts in Beijing (Tiananmen square, forbidden city, summer palace, ...) and a facet on several travel-related topics (attractions, shopping, dining, ...). Query facets provide interesting and useful knowledge about a query and thus can be used to improve search experiences in many ways.…
Read More

Detecting Malicious Facebook Applications

Artificial Intelligence & ML, Machine Learning
Detecting Malicious Facebook Applications With 20 million installs a day, third-party apps are a major reason for the popularity and addictiveness of Facebook. Unfortunately, hackers have realized the potential of using apps for spreading malware and spam. The problem is already significant, as we find that at least 13% of apps in our dataset are malicious. So far, the research community has focused on detecting malicious posts and campaigns. In this paper, we ask the question: Given a Facebook application, can we determine if it is malicious? Our key contribution is in developing FRAppE-Facebook's Rigorous Application Evaluator-arguably the first tool focused on detecting malicious apps on Facebook. To develop FRAppE, we use information gathered by observing the posting behavior of 111K Facebook apps seen across 2.2 million users on Facebook.…
Read More

Sentiment Analysis of Top Colleges in India Using Twitter Data

Artificial Intelligence & ML, Data mining, Machine Learning
Sentiment Analysis of Top Colleges in India Using Twitter Data Social Media has captured the attention of the entire world as it is thundering fast in sending thoughts across the globe, user-friendly and free of cost requiring only a working internet connection. People are extensively using this platform to share their thoughts loud and clear. Twitter is one such well-known micro-blogging site getting around 500 million tweets per day. Each user has a daily limit of 2,400 tweets and 140 characters per tweet. Twitter users post (or ‘tweet’) every day about various subjects like products, services, day to day activities, places, personalities etc. Hence, Twitter data is of great germane as it can be used in various scenarios where companies or brands can utilize a direct connection to almost each…
Read More

FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce

Artificial Intelligence & ML, Data mining, Hadoop
FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce Data mining is a process of discovering the pattern from the huge amount of data. There are many data mining technics like clustering, classification and association rule. The most popular one is the association rule that is divided into two parts i) generating the frequent itemset ii) generating association rule from all itemsets. Frequent itemset mining (FIM) is the core problem in the association rule mining. Sequential FIM algorithm suffers from performance deterioration when it operated on a huge amount of data on a single machine.to address this problem parallel FIM algorithms were proposed. There are two types of algorithms that can be used for mining the frequent itemsets first method is the candidate-itemset generation approach and without candidate itemset generation algorithm.…
Read More

Workflow-Based Big Data Analytics in The Cloud Environment

Artificial Intelligence & ML, Machine Learning, MSC IT
Workflow-Based Big Data Analytics in The Cloud Environment Since digital data repositories are more and more massive and distributed, we need smart data analysis techniques and scalable architectures to extract useful information from them in reduced time. Cloud computing infrastructures offer an effective support for addressing both the computational and data storage needs of big data mining applications. In fact, complex data mining tasks involve data- and compute-intensive algorithms that require large and efficient storage facilities together with high-performance processors to get results in acceptable times. In this chapter, we present a Data Mining Cloud Framework designed for developing and executing distributed data analytics applications as workflows of services. In this environment, we use datasets, analysis tools, data mining algorithms and knowledge models that are implemented as single services that can be combined through a visual programming…
Read More

Modeling Urban Behavior by Mining Geotagged Social Data

Artificial Intelligence & ML, Machine Learning
Modeling Urban Behavior by Mining Geotagged Social Data Data generated on location-based social networks provide rich information on the whereabouts of urban dwellers. Specifically, such data reveal who spends time where, when, and on what type of activity (e.g., shopping at a mall, or dining at a restaurant). That information can, in turn, be used to describe city regions in terms of activity that takes place therein. For example, the data might reveal that citizens visit one region mainly for shopping in the morning, while another for dining in the evening. Furthermore, once such a description is available, one can ask more elaborate questions. For example, one might ask what features distinguish one region from another - some regions might be different in terms of the type of venues they…
Read More

ClubCF: A Clustering-based Collaborative Filtering Approach for Big Data Application

Hadoop
"ClubCF: A Clustering-based Collaborative Filtering Approach for Big Data Application Spurred by service computing and cloud computing, an increasing number of services are emerging on the Internet. As a result, service-relevant data become too big to be effectively processed by traditional approaches. In view of this challenge, a Clustering-based Collaborative Filtering approach (ClubCF) is proposed in this paper, which aims at recruiting similar services in the same clusters to recommend services collaboratively. Technically, this approach is enacted around two stages. In the first stage, the available services are divided into small-scale clusters, in logic, for further processing. At the second stage, a collaborative filtering algorithm is imposed on one of the clusters. Since the number of the services in a cluster is much less than the total number of the…
Read More

Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework

Hadoop
"Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework The buzz-word big-data refers to the large-scale distributed data processing applications that operate on exceptionally large amounts of data. Google’s MapReduce and Apache’s Hadoop, its open-source implementation, are the defacto software systems for big-data applications. An observation of the MapReduce framework is that the framework generates a large amount of intermediate data. Such abundant information is thrown away after the tasks finish, because MapReduce is unable to utilize them. In this paper, we propose Dache, a data-aware cache framework for big-data applications. In Dache, tasks submit their intermediate results to the cache manager. A task queries the cache manager before executing the actual computing work. A novel cache description scheme and a cache request and reply protocol are…
Read More

Cost Minimization for Big Data Processing in Geo-Distributed Data Centers

Hadoop
"Cost Minimization for Big Data Processing in Geo-Distributed Data Centers The explosive growth of demands on big data processing imposes a heavy burden on computation, storage, and communication in data centers, which hence incurs considerable operational expenditure to data center providers. Therefore, cost minimization has become an emergent issue for the upcoming big data era. Different from conventional cloud services, one of the main features of big data services is the tight coupling between data and computation as computation tasks can be conducted only when the corresponding data is available. As a result, three factors, i.e., task assignment, data placement and data movement, deeply influence the operational expenditure of data centers. In this paper, we are motivated to study the cost minimization problem via a joint optimization of these three…
Read More

KASR: A Keyword-Aware Service Recommendation Method on MapReduce for Big Data

Hadoop
"KASR: A Keyword-Aware Service Recommendation Method on MapReduce for Big Data Applications Service recommender systems have been shown as valuable tools for providing appropriate recommendations to users. In the last decade, the amount of customers, services and online information has grown rapidly, yielding the big data analysis problem for service recommender systems. Consequently, traditional service recommender systems often suffer from scalability and inefficien-cy problems when processing or analysing such large-scale data. Moreover, most of existing service recommender systems present the same ratings and rankings of services to different users without considering diverse users' preferences, and therefore fails to meet users' personalized requirements. In this paper, we propose a Keyword-Aware Service Recommendation method, named KASR, to address the above challenges. It aims at presenting a personalized service recommendation list and recommending…
Read More

Authorized Public Auditing of Dynamic Big Data Storage on Cloud with Efficient Verifiable Fine-grained Updates

Hadoop
"Authorized Public Auditing of Dynamic Big Data Storage on Cloud with Efficient Verifiable Fine-grained Updates Cloud computing opens a new era in IT as it can provide various elastic and scalable IT services in a pay-as-you-go fashion, where its users can reduce the huge capital investments in their own IT infrastructure. In this philosophy, users of cloud storage services no longer physically maintain direct control over their data, which makes data security one of the major concerns of using cloud. Existing research work already allows data integrity to be verified without possession of the actual data file. When the verification is done by a trusted third party, this verification process is also called data auditing, and this third party is called an auditor. However, such schemes in existence suffer from…
Read More

Privacy Preserving Data Analytics for Smart Homes

Hadoop
"Privacy Preserving Data Analytics for Smart Homes A framework for maintaining security & preserving privacy for analysis of sensor data from smart homes, without compromising on data utility is presented. Storing the personally identifiable data as hashed values withholds identifiable information from any computing nodes. However the very nature of smart home data analytics is establishing preventive care. Data processing results should be identifiable to certain users responsible for direct care. Through a separate encrypted identifier dictionary with hashed and actual values of all unique sets of identifiers, we suggest re-identification of any data processing results. However the level of re-identification needs to be controlled, depending on the type of user accessing the results. Generalization and suppression on identifiers from the identifier dictionary before re-introduction could achieve different levels of…
Read More

MRPrePost-A parallel algorithm adapted for mining big data

Hadoop
"MRPrePost-A parallel algorithm adapted for mining big data With the explosive growth in data, using data mining techniques to mine association rules, and then to find valuable information hidden in big data has become increasingly important. Various existing data mmmg techniques often through mining frequent itemsets to derive association rules and access to relevant knowledge, but with the rapid arrival of the era of big data, Traditional data mining algorithms have been unable to meet large data's analysis needs. In view of this, this paper proposes an adaptation to the big data mining parallel algorithms-MRPrePost. MRPrePost is a parallel algorithm based on Hadoop platform, which improves PrePost by way of adding a prefix pattern, and on this basis into the parallel design ideas, making MRPrePost algorithm can adapt to mining…
Read More

Enabling Efficient Access Control with Dynamic Policy Updating for Big Data in the Cloud

Hadoop
"Enabling Efficient Access Control with Dynamic Policy Updating for Big Data in the Cloud Due to the high volume and velocity of big data, it is an effective option to store big data in the cloud, because the cloud has capabilities of storing big data and processing high volume of user access requests. Attribute-Based Encryption (ABE) is a promising technique to ensure the end-to-end security of big data in the cloud. However, the policy updating has always been a challenging issue when ABE is used to construct access control schemes. A trivial implementation is to let data owners retrieve the data and re-encrypt it under the new access policy, and then send it back to the cloud. This method incurs a high communication overhead and heavy computation burden on data…
Read More

Load Balancing for Privacy-Preserving Access to Big Data in Cloud

Hadoop
"Load Balancing for Privacy-Preserving Access to Big Data in Cloud In the era of big data, many users and companies start to move their data to cloud storage to simplify data management and reduce data maintenance cost. However, security and privacy issues become major concerns because third-party cloud service providers are not always trusty. Although data contents can be protected by encryption, the access patterns that contain important information are still exposed to clouds or malicious attackers. In this paper, we apply the ORAM algorithm to enable privacy-preserving access to big data that are deployed in distributed file systems built upon hundreds or thousands of servers in a single or multiple geo-distribu ted cloud sites. Since the ORAM algorithm would lead to serious access load unbalance among storage servers, we…
Read More

Secure Sensitive Data Sharing on a Big Data Platform

Hadoop
"Secure Sensitive Data Sharing on a Big Data Platform Users store vast amounts of sensitive data on a big data platform. Sharing sensitive data will help enterprises reduce the cost of providing users with personalized services and provide value-added data services. However, secure data sharing is problematic. This paper proposes a framework for secure sensitive data sharing on a big data platform, including secure data delivery, storage, usage, and destruction on a semi-trusted big data sharing platform. We present a proxy re-encryption algorithm based on heterogeneous ciphertext transformation and a user process protection method based on a virtual machine monitor, which provides support for the realization of system functions. The framework protects the security of users’ sensitive data effectively and shares these data safely. At the same time, data owners…
Read More

Building a Big Data Analytics Service Framework for Mobile Advertising and Marketing

Hadoop
"Building a Big Data Analytics Service Framework for Mobile Advertising and Marketing The unprecedented growth in mobile device adoption and the rapid advancement of mobile technologies & wireless networks have created new opportunities in mobile marketing and adverting. The opportunities for Mobile Marketers and Advertisers include real-time customer engagement, improve customer experience, build brand loyalty, increase revenues, and drive customer satisfaction. The challenges, however, for the Marketers and Advertisers include how to analyze troves of data that mobile devices emit and how to derive customer engagement insights from the mobile data. This research paper addresses the challenge by developing Big Data Mobile Marketing analytics and advertising recommendation framework. The proposed framework supports both offline and online advertising operations in which the selected analytics techniques are used to provide advertising recommendations…
Read More

PaWI: ParallelWeighted Itemset Mining by means of MapReduce

Hadoop
"PaWI: ParallelWeighted Itemset Mining by means of MapReduce Frequent itemset mining is an exploratory data mining technique that has fruitfully been exploited to extract recurrent co-occurrences between data items. Since in many application contexts items are enriched with weights denoting their relative importance in the analyzed data, pushing item weights into the itemset mining process, i.e., mining weighted itemsets rather than traditional itemsets, is an appealing research direction. Although many efficient in-memory weighted itemset mining algorithms are available in literature, there is a lack of parallel and distributed solutions which are able to scale towards Big Weighted Data. This paper presents a scalable frequent weighted itemset mining algorithm based on the MapReduce paradigm. To demonstrate its actionability and scalability, the proposed algorithm was tested on a real Big dataset collecting…
Read More

Performance Analysis of Scheduling Algorithms for Dynamic Workflow Applications

Hadoop
"Performance Analysis of Scheduling Algorithms for Dynamic Workflow Applications In recent years, Big Data has changed how we do computing. Even though we have large scale infrastructure such as Cloud computing and several platforms such as Hadoop available to process the workloads, with Big Data there is a high level of uncertainty that has been introduced in how an application processes the data. Data in general comes in different formats, at different speed and at different volume. Processing consists of not just one application but several applications combined to form a workflow to achieve a certain goal. With data variation and at different speed, applications execution and resource needs will also vary at runtime. These are called dynamic workflows. One can say that we can just throw more and more…
Read More

BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value Store

Hadoop
"BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value Store Nowadays, cloud-based storage services are rapidly growing and becoming an emerging trend in data storage field. There are many problems when designing an efficient storage engine for cloud-based systems with some requirements such as big-file processing, lightweight meta-data, low latency, parallel I/O, deduplication, distributed, high scalability. Key-value stores played an important role and showed many advantages when solving those problems. This paper presents about Big File Cloud (BFC) with its algorithms and architecture to handle most of problems in a big-file cloud storage system based on keyvalue store. It is done by proposing low-complicated, fixed-size meta-data design, which supports fast and highly-concurrent, distributed file I/O, several algorithms for resumable upload, download and simple data deduplication method for static data. This…
Read More

Privacy Preserving Data Analysis in Mental Health Research

Hadoop
"Privacy Preserving Data Analysis in Mental Health Research The digitalization of mental health records and psychotherapy notes has made individual mental health data more readily accessible to a wide range of users including patients, psychiatrists, researchers, statisticians, and data scientists. However, increased accessibility of highly sensitive mental records threatens the privacy and confidentiality of psychiatric patients. The objective of this study is to examine privacy concerns in mental health research and develop a privacy preserving data analysis approach to address these concerns. In this paper, we demonstrate the key inadequacies of the existing privacy protection approaches applicable to use of mental health records and psychotherapy notes in recordsbased research. We then develop a privacy-preserving data analysis approach that enables researchers to protect the privacy of people with mental illness once…
Read More

Recent Advances in Autonomic Provisioning of Big Data Applications on Clouds

Hadoop
"Recent Advances in Autonomic Provisioning of Big Data Applications on Clouds CLOUD computing [1] assembles large networks of virtualized ICT services such as hardware resources (such as CPU, storage, and network), software resources (such as databases, application servers, and web servers) and applications.In industry these services are referred to as infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS). Mainstream ICT powerhouses such as Amazon, HP, and IBM are heavily investing in the provision and support of public cloud infrastructure. Cloud computing is rapidly becoming a popular infrastructure of choice among all types of organisations. Despite some initial security concerns and technical issues, an increasing number of organisations have moved their applications and services in to “The Cloud”. These applications range from generic…
Read More

Processing Geo-Dispersed Big Data in an Advanced MapReduce Framework

Hadoop
"Processing Geo-Dispersed Big Data in an Advanced MapReduce Framework Big data takes many forms, including messages in social networks, data collected from various sensors, captured videos, and so on. Big data applications aim to collect and analyze large amounts of data, and efficiently extract valuable information from the data. A recent report shows that the amount of data on the Internet is about 500 billion GB. With the fast increase of mobile devices that can perform sensing and access the Internet, large amounts of data are generated daily. In general, big data has three features: large volume, high velocity and large variety [1]. The International Data Corporation (IDC) predicted that the total amount of data generated in 2020 globally will be about 35 ZB. Facebook needs to process about 1.3…
Read More

Deduplication on Encrypted Big Data in Cloud

Hadoop
"Deduplication on Encrypted Big Data in Cloud Cloud computing offers a new way of service provision by re-arranging various resources over the Internet. The most important and popular cloud service is data storage. In order to preserve the privacy of data holders, data are often stored in cloud in an encrypted form. However, encrypted data introduce new challenges for cloud data deduplication, which becomes crucial for big data storage and processing in cloud. Traditional deduplication schemes cannot work on encrypted data. Existing solutions of encrypted data deduplication suffer from security weakness. They cannot flexibly support data access control and revocation. Therefore, few of them can be readily deployed in practice. In this paper, we propose a scheme to deduplicate encrypted data stored in cloud based on ownership challenge and proxy…
Read More

Big data, big knowledge: big data for personalised healthcare

Hadoop
"Big data, big knowledge: big data for personalised healthcare The idea that the purely phenomenological knowledge that we can extract by analysing large amounts of data can be useful in healthcare seems to contradict the desire of VPH researchers to build detailed mechanistic models for individual patients. But in practice no model is ever entirely phenomenological or entirely mechanistic. We propose in this position paper that big data analytics can be successfully combined with VPH technologies to produce robust and effective in silico medicine solutions. In order to do this, big data technologies must be further developed to cope with some specific requirements that emerge from this application. Such requirements are: working with sensitive data; analytics of complex and heterogeneous data spaces, including non-textual information; distributed data management under security…
Read More

A Time Efficient Approach for Detecting Errors in Big Sensor Data on Cloud

Hadoop
"A data mining framework to analyze road accident data Big sensor data is prevalent in both industry and scientific research applications where the data is generated with high volume and velocity it is difficult to process using on-hand database management tools or traditional data processing applications. Cloud computing provides a promising platform to support the addressing of this challenge as it provides a flexible stack of massive computing, storage, and software services in a scalable manner at low cost. Some techniques have been developed in recent years for processing sensor data on cloud, such as sensor-cloud. However, these techniques do not provide efficient support on fast detection and locating of errors in big sensor data sets. For fast data error detection in big sensor data sets, in this paper, we…
Read More

A data mining framework to analyze road accident data

Hadoop
"A data mining framework to analyze road accident data Road and traffic accidents are uncertain and unpredictable incidents and their analysis requires the knowledge of the factors affecting them. Road and traffic accidents are defined by a set of variables which are mostly of discrete nature. The major problem in the analysis of accident data is its heterogeneous nature [1]. Thus heterogeneity must be considered during analysis of the data otherwise, some relationship between the data may remain hidden. Although, researchers used segmentation of the data to reduce this heterogeneity using some measures such as expert knowledge, but there is no guarantee that this will lead to an optimal segmentation which consists of homogeneous groups of road accidents [2]. Therefore, cluster analysis can assist the segmentation of road accidents.
Read More

Big Data Challenges in Smart Grid IoT (WAMS) Deployment

Hadoop
"Big Data Challenges in Smart Grid IoT (WAMS) Deployment Internet of Things adoption across industries has proven to be beneficial in providing business value by transforming the way data is utilized in decision making and visualization. Power industry has for long struggled with traditional ways of operating and has suffered from issues like instability, blackouts,etc. The move towards smart grid has thus received lot of acceptance. This paper presents the Internet of Things deployment in grid, namely WAMS, and the challenges it present in terms of the Big Data it aggregates. Better insight into the problem is provided with the help of Indian Grid case studies.
Read More

Review Based Service Recommendation for Big Data

Hadoop
"Review Based Service Recommendation for Big Data Success of web 2.0 brings online information overload. An exponential growth of customers, services and online information has been observed in last decade. It yields big data investigation problem for service recommendation system. Traditional recommender systems often put up with scalability, lack of security and efficiency problems. Users preferences are almost ignored. So, the requirement of robust ecommendation system is enhanced now a days. In this paper, we present review based service recommendation to dynamically recommend services to the users. Keywords are extracted from passive users reviews and a rating value is given to every new keyword observed in the dataset. Sentiment analysis is performed on these rating values and top-k services recommendation list is provided to users. To make the system more…
Read More

A Profile-Based Big Data Architecture for Agricultural Context

Hadoop
"A Profile-Based Big Data Architecture for Agricultural Context Bringing Big data technologies into agriculture presents a significant challenge; at the same time, this technology contributes effectively in many countries’ economic and social development. In this work, we will study environmental data provided by precision agriculture information technologies, which represents a crucial source of data in need of being wisely managed and analyzed with appropriate methods and tools in order to extract the meaningful information. Our main purpose through this paper is to propose an effective Big data architecture based on profiling system which can assist (among others) producers, consulting companies, public bodies and research laboratories to make better decisions by providing them real time data processing, and a dynamic big data service composition method, to enhance and monitor the agricultural…
Read More

Achieving Efficient and Privacy-Preserving Cross-Domain Big Data Deduplication in Cloud

Hadoop
"Achieving Efficient and Privacy-Preserving Cross-Domain Big Data Deduplication in Cloud Secure data deduplication can significantly reduce the communication and storage overheads in cloud storage services, and has potential applications in our big data-driven society. Existing data deduplication schemes are generally designed to either resist brute-force attacks or ensure the efficiency and data availability, but not both conditions. We are also not aware of any existing scheme that achieves accountability, in the sense of reducing duplicate information disclosure (e.g., to determine whether plaintexts of two encrypted messages are identical). In this paper, we investigate a three-tier cross-domain architecture, and propose an efficient and privacy-preserving big data deduplication in cloud storage (hereafter referred to as EPCDD). EPCDD achieves both privacy-preserving and data availability, and resists brute-force attacks. In addition, we take accountability…
Read More

A Queuing Method for Adaptive Censoring in Big Data Processing

Hadoop
"A Queuing Method for Adaptive Censoring in Big Data Processing As more than 2.5 quintillion bytes of data are generated every day, the era of big data is undoubtedly upon us. Running analysis on extensive datasets is a challenge. Fortunately, a significant percentage of the data accrued can be omitted while maintaining a certain quality of statistical inference in many cases. Censoring provides us a natural option for data reduction. However, the data chosen by censoring occur nonuniformly, which may not relieve the computational resource requirement. In this paper, we propose a dynamic, queuing method to smooth out the data processing without sacrificing the convergence performance of censoring. The proposed method entails simple, closed-form updates, and has no loss in terms of accuracy comparing to the original adaptive censoring method.Simulation…
Read More

A Reliable Task Assignment Strategy for Spatial Crowdsourcing in Big Data Environment

Hadoop
"A Reliable Task Assignment Strategy for Spatial Crowdsourcing in Big Data Environment With the ubiquitous deployment of the mobile devices with increasingly better communication and computation capabilities, an emerging model called spatial crowdsourcing is proposed to solve the problem of unstructured big data by publishing location-based tasks to participating workers. However, massive spatial data generated by spatial crowdsourcing entails a critical challenge that the system has to guarantee quality control of crowdsourcing. This paper first studies a practical problem of task assignment, namely reliability aware spatial crowdsourcing (RA-SC), which takes the constrained tasks and numerous dynamic workers into consideration. Specifically, the worker confidence is introduced to reflect the completion reliability of the assigned task. Our RA-SC problem is to perform task assignments such that the reliability under budget constraints is…
Read More

An Approximate Search Framework for Big Data

Hadoop
"An Approximate Search Framework for Big Data In the age of big data, a traditional scanning search pattern is gradually becoming unfit for a satisfying user experience due to its lengthy computing process. In this paper, we propose a sampling-based approximate search framework called Hermes, to meet user’s query demand for both accurate and efficient results. A novel metric, (ε, δ)-approximation, is presented to uniformly measure accuracy and efficiency for a big data search service, which enables Hermes to work out a feasible searching job. Based on this, we employ the bootstrapping technique to further speed up the search process. Moreover, an incremental sampling strategy is investigated to process homogeneous queries; in addition, the reuse theory of historical results is also studied for the scenario of appending data. Theoretical analyses…
Read More

Big Data Analytics of Geosocial Media for Planning and Real-Time Decisions

Hadoop
"Big Data Analytics of Geosocial Media for Planning and Real-Time Decisions Geosocial Network data can be served as an asset for the authorities to make real-time decisions and future planning by analyzing geosocial media posts. However, there are millions of Geosocial Network users who are producing overwhelming of data, called “Big Data” that is challenging to be analyzed and make real-time decisions. Therefore, in this paper, we proposed an efficient system for exploring Geosocial Networks while harvesting data as well as user’s location information. A system architecture is proposed that processes an abundant amount of various social networks’ data to monitor Earth events, incidents, medical diseases, user trends, and views to make future real-time decisions and facilitate future planning. The proposed system consists of five layers, i.e., data collection, data…
Read More

Big Data Driven Information Diffusion Analysis and Control in Online Social Networks

Hadoop
"Big Data Driven Information Diffusion Analysis and Control in Online Social Networks Thanks to recent advance in massive social data and increasingly mature big data mining technologies, information diffusion and its control strategies have attracted much attention, which play pivotal roles in public opinion control, virus marketing as well as other social applications. In this paper, relying on social big data, we focus on the analysis and control of information diffusion. Specifically, we commence with analyzing the topological role of the social strengths, i.e., tie strength, partial strength, value strength, and their corresponding symmetric as well as asymmetric forms. Then, we define two critical points for the cascade information diffusion model, i.e., the information coverage critical point (CCP) and the information heat critical point (HCP). Furthermore, based on the two…
Read More

Big Data Set Privacy Preserving through Sensitive Attribute-based Grouping

Hadoop
"Big Data Set Privacy Preserving through Sensitive Attribute-based Grouping There is a growing trend towards attacks on database privacy due to great value of privacy information stored in big data set. Public’s privacy are under threats as adversaries are continuously cracking their popular targets such as bank accounts. We find a fact that existing models such as K-anonymity, group records based on quasi-identifiers, which harms the data utility a lot. Motivated by this, we propose a sensitive attribute-based privacy model. Our model is the early work of grouping records based on sensitive attributes instead of quasi-identifiers which is popular in existing models. Random shuffle is used to maximize information entropy inside a group while the marginal distribution maintains the same before and after shuffling, therefore, our method maintains a better…
Read More

Big-Data-Driven Network Partitioning for Ultra-Dense Radio Access Networks

Hadoop
"Big-Data-Driven Network Partitioning for Ultra-Dense Radio Access Networks The increased density of base stations (BSs) may significantly add complexity to network management mechanisms and hamper them from efficiently managing the network. In this paper, we propose a big-data-driven network partitioning and optimization framework to reduce the complexity of the networking mechanisms. The proposed framework divides the entire radio access network (RAN) into multiple sub-RANs and each sub-RAN can be managed independently. Therefore, the complexity of the network management can be reduced. Quantifying the relationships among BSs is challenging in the network partitioning. We propose to extract three networking features from mobile traffic data to discover the relationships. Based on these features, we engineer the network partitioning solution in three steps. First, we design a hierarchical clustering analysis (HCA) algorithm to…
Read More

Cost Aware Cloudlet Placement for Big Data Processing at the Edge

Hadoop
"Cost Aware Cloudlet Placement for Big Data Processing at the Edge As accessing computing resources from the remote cloud for big data processing inherently incurs high end-toend (E2E) delay for mobile users, cloudlets, which are deployed at the edge of networks, can potentially mitigate this problem. Although load offloading in cloudlet networks has been proposed, placing the cloudlets to minimize the deployment cost of cloudlet providers and E2E delay of user requests has not been addressed so far. The locations and number of cloudlets and their servers have a crucial impact on both the deployment cost and E2E delay of user requests. Therefore, in this paper, we propose the Cost Aware cloudlet PlAcement in moBiLe Edge computing strategy (CAPABLE) to optimize the tradeoff between the deployment cost and E2E delay.…
Read More

CryptMDB: A Practical Encrypted MongoDB over Big Data

Hadoop
"CryptMDB: A Practical Encrypted MongoDB over Big Data In big data era, data are usually stored in databases for easy access and utilization, which are now woven into every aspect of our lives. However, traditional relational databases cannot address users’ demands for quick data access and calculating, since they cannot process data in a distributed way. To tackle this problem, non-relational databases such as MongoDB have emerged up and been applied in various Scenarios. Nevertheless, it should be noted that most MongoDB products fail to consider user’s data privacy. In this paper, we propose a practical encrypted MongoDB ( i.e., CryptMDB ). Specifically, we utilize an additive homomorphic asymmetric cryptosystem to encrypt user’s data and achieve strong privacy protection. Security analysis indicates that the CryptMDB can achieve confidentiality of user’s…
Read More

Focusing on a Probability Element: Parameter Selection of Message Importance Measure in Big Data

Hadoop
"Focusing on a Probability Element: Parameter Selection of Message Importance Measure in Big Data Message importance measure (MIM) is applicable to characterize the importance of information in the scenario of big data, similar to entropy in information theory. In fact, MIM with a variable parameter can make an effect on the characterization of distribution. Furthermore, by choosing an appropriate parameter of MIM, it is possible to emphasize the message importance of a certain probability element in a distribution. Therefore, parametric MIM can play a vital role in anomaly detection of big data by focusing on probability of an anomalous event. In this paper, we propose a parameter selection method of MIM focusing on a probability element and then present its major properties. In addition, we discuss the parameter selection with…
Read More

Holistic Perspective of Big Data in Healthcare

Hadoop
"Holistic Perspective of Big Data in Healthcare Healthcare has increased its overall value by adopting big data methods to analyze and understand its data from various sources. This article presents big data from the perspective of improving healthcare services and, also, offers a holistic view of system security and factors determining security breaches.
Read More

Novel Common Vehicle Information Model (CVIM) for Future Automotive Vehicle Big Data Marketplaces

Hadoop
"Novel Common Vehicle Information Model (CVIM) for Future Automotive Vehicle Big Data Marketplaces Even though connectivity services have been introduced in many of the most recent car models, access to vehicle data is currently limited due to its proprietary nature. The European project AutoMat has therefore developed an open Marketplace providing a single point of access for brandindependent vehicle data. Thereby, vehicle sensor data can be leveraged for the design and implementation of entirely new services even beyond traffic-related applications (such as hyperlocal traffic forecasts). This paper presents the architecture for a Vehicle Big Data Marketplace as enabler of cross-sectorial and innovative vehicle data services. Therefore, the novel Common Vehicle Information Model (CVIM) is defined as an open and harmonized data model, allowing the aggregation of brandindependent and generic data…
Read More

Online Data Deduplication for In-Memory Big-Data Analytic Systems

Hadoop
"Online Data Deduplication for In-Memory Big-Data Analytic Systems Given a set of files that show a certain degree of similarity, we consider a novel problem of performing data redundancy elimination across a set of distributed worker nodes in a shared-nothing in-memory big data analytic system. The redundancy elimination scheme is designed in a manner that is: (i) space-efficient: the total space needed to store the files is minimized and, (ii) access-isolation: data shuffling among server is also minimized. In this paper, we first show that finding an access-efficient and space optimal solution is an NP-Hard problem. Following this, we present the file partitioning algorithms that locate access-efficient solutions in an incremental manner with minimal algorithm time complexity (polynomial time). Our experimental verification on multiple data sets confirms that the proposed…
Read More

Traffic-aware Task Placement with Guaranteed Job Completion Time for Geo-distributed Big Data

Hadoop
"Traffic-aware Task Placement with Guaranteed Job Completion Time for Geo-distributed Big Data Big data analysis is usually casted into parallel jobs running on geo-distributed data centers. Different from a single data center, geo-distributed environment imposes big challenges for big data analytics due to the limited network bandwidth between data centers located in different regions.Although research efforts have been devoted to geo-distributed big data, the results are still far from being efficient because of their suboptimal performance or high complexity. In this paper, we propose a traffic-aware task placement to minimize job completion time of big data jobs. We formulate the problem as a non-convex optimization problem and design an algorithm to solve it with proved performance gap. Finally, extensive simulations are conducted to evaluate the performance of our proposal. The…
Read More

QoS-Aware Data Replications and Placements for Query Evaluation of Big Data Analytics

Hadoop
"QoS-Aware Data Replications and Placements for Query Evaluation of Big Data Analytics Enterprise users at different geographic locations generate large-volume data and store their data at different geographic datacenters. These users may also issue ad hoc queries of big data analytics on the stored data to identify valuable information in order to help them make strategic decisions. However, it is well known that querying such large-volume big data usually is time-consuming and costly. Sometimes, users are only interested in timely approximate rather than exact query results. When this approximation is the case, applications must sacrifice either timeliness or accuracy by allowing either the latency of delivering more accurate results or the accuracy error of delivered results based on the samples of the data, rather than the entire set of data…
Read More

Twitter data analysis and visualizations using the R language on top of the Hadoop platform

Hadoop
"Twitter data analysis and visualizations using the R language on top of the Hadoop platform The main objective of the work presented within this paper was to design and implement the system for twitter data analysis and visualization in R environment using the big data processing technologies. Our focus was to leverage existing big data processing frameworks with its storage and computational capabilities to support the analytical functions implemented in R language. We decided to build the backend on top of the Apache Hadoop framework including the Hadoop HDFS as a distributed filesystem and MapReduce as a distributed computation paradigm. RHadoop packages were then used to connect the R environment to the processing layer and to design and implement the analytical functions in a distributed manner. Visualizations were implemented on…
Read More

A Micro-video Recommendation System Based on Big Data

Hadoop
A Micro-video Recommendation System Based on Big Data With the development of the Internet and social networking service, the micro-video is becoming more popular, especially for youngers. However, for many users, they spend a lot of time to get their favorite micro-videos from amounts videos on the Internet; for the micro-video producers, they do not know what kinds of viewers like their products. Therefore, this paper proposes a micro-video recommendation system. The recommendation algorithms are the core of this system. Traditional recommendation algorithms include content-based recommendation, collaboration recommendation algorithms, and so on. At the Bid Data times, the challenges what we meet are data scale, performance of computing, and other aspects. Thus, this paper improves the traditional recommendation algorithms, using the popular parallel computing framework to process the Big Data.…
Read More

Image Processing Based Fire Detection Using Raspberry Pi

Sensor
Image Processing Based Fire Detection Using Raspberry Pi The main advantage of Image Processing Based Fire Detection System is the early warning benefit. This system can be installed just about any where in a commercial building, malls and at many more public places for fire detection. This system uses camera for detecting fires. SO we do not need any other sensors to detect fire. System processes the camera input and then processor processes it to detect fires. The heat signatures and fire illumination patterns are detected in images to determine if it is a fire and take action accordingly. On detecting fire system goes into emergency mode and sounds an alarm. Also displays the status on the LCD display informing about the system.
Read More

Linux Based Speaking Medication Reminder Project

Sensor
Linux Based Speaking Medication Reminder Project Medication mix-ups are extremely dangerous ,to avoid this Linux based Speaking Medication Remindercan help to prevent these life-threatening mistakes. It first allows users to enter reminder inputs. System takes input through keyboard to accept various reminders with date and time and dosage. It then reminds patients to take the right medication at the right time. System allows users to store their medication dates and time using raspberry pi. Also users are allowed to enter dosage of each reminder. On set time the system gets the details and converts text to speech. System now speaks out the medication reminder at fed time intervals. This allows for a fully automated medication reminder system for patients.
Read More

Automated Door Opener With Lighting Control Using Raspberry Pi

Sensor
Automated Door Opener With Lighting Control Using Raspberry Pi We here propose a system that uses raspberry pi along with passive IR sensors, motors and lights to demonstrate an automated door opening and lighting control system. Proposed system allows for automated door opening using human detection with lighting control using raspberry pi. System detects human presence and then opens door automatically depending on the human sensing input detected. Also the system keeps track of lighting conditions in the room and depending on the lighting needed system switches lights to get desired illumination. Also system tries to detect number of humans present in room and then operates the lighting accordingly. All sensor data including pir as well as light illumination data is constantly transferred to the pi processor for processing which…
Read More

Camera Based Surveillance System Using Raspberry Pi

Sensor
Camera Based Surveillance System Using Raspberry Pi Camera Based Surveillance System Using Raspberry Pi is mainly beneficial for determining crime, It monitors scenarios and activities, helpful for gathering evidences and detecting thefts instantly. The system is built to monitor home, offices and detect theft as soon as it takes place. System uses Raspberry Pi with a camera based circuit. System constantly monitors camera for motion. The camera input is constantly fed to the pi processor. The camera input is constantly processed by Raspberry Pi processor for any motion. If any motion is detected, the system goes into alert mode. System now sounds alarm as well as captures images of the motion happening. These images are saved for later viewing reference. Thus the system is an efficient camera based security system.…
Read More

Drunk Driving Detection With Car Ignition Locking

Sensor
Drunk Driving Detection With Car Ignition Locking Drunk driving is the reason behind most of the deaths, so the Drunk Driving Detection With Car Ignition Locking Using Raspberry Pi aims to change that with automated, transparent, noninvasive alcohol safety check in vehicles. The system uses raspberry pi with alcohol sensors ,dc motor, lcd display circuit to achieve this purpose. System uses alcohol sensor with, raspberry pi with dc motor to demonstrate as vehicle engine. System constantly monitors the sensitivity of alcohol sensor for drunk driver detection. If driver is drunk, the processor instantly stops the system ignition by stopping the motor. If alcohol sensor is not giving high alcohol intensity signals, system lets engine run. The raspberry pi processor constantly processes the alcohol sensor data to check drunk driving and…
Read More

Ultrasonic Blind Walking Stick

Sensor
Ultrasonic Blind Walking Stick Blind stick is an innovative stick designed for visually disabled people for improved navigation. We here propose an advanced blind stick that allows visually challenged people to navigate with ease using advanced technology. The blind stick is integrated with ultrasonic sensor along with light and water sensing. Our proposed project first uses ultrasonic sensors to detect obstacles ahead using ultrasonic waves. On sensing obstacles the sensor passes this data to the microcontroller. The microcontroller then processes this data and calculates if the obstacle is close enough. If the obstacle is not that close the circuit does nothing. If the obstacle is close the microcontroller sends a signal to sound a buzzer. It also detects and sounds a different buzzer if it detects water and alerts the…
Read More

Rain Sensing Automatic Car Wiper

Sensor
Rain Sensing Automatic Car Wiper Today’s car wipers are manual systems that work on the principle of manual switching. So here we propose an automatic wiper system that automatically switches ON on detecting rain and stops when rain stops. Our project brings forward this system to automate the wiper system having no need for manual intervention. For this purpose we use rain sensor along with microcontroller and driver IC to drive the wiper motor. Our system uses rain sensor to detect rain, this signal is then processed by microcontroller to take the desired action. The rain sensor works on the principle of using water for completing its circuit, so when rain falls on it it’s circuit gets completed and sends out a signal to the microcontroller. The microcontroller now processes…
Read More

Mobile Charging On Coin Insertion

Sensor
Mobile Charging On Coin Insertion This is the smart coin based mobile charging system that charges your mobile for particular amount of time on inserting a coin. The system is to be used by shop owners, public places like railway stations to provide mobile charging facility. So the system consists of a coin recognition module that recognizes valid coins and then signals the microcontroller for further action. If a valid coin is found it signals the microcontroller and microcontroller then starts the mobile charging mechanism providing a 5V supply through a power supply section to the mobile phone, now system also needs to monitor the amount of charging to be provided. So the microcontroller starts a reverse countdown timer to display the charging time for that mobile phone. Now if…
Read More

FingerPrint Voting System

Sensor
FingerPrint Voting System One of the major factors to be taken care of in a voting process is authentication and authorization of voters. Many conditions need to be checked to ensure these factors. Major conditions include: 1. Check authenticity of voter 2. Authorize legitimate voters to vote 3. Avoid double vote casting by any individual Checking if all these conditions manually is a very complicated and exhausting task with many chances of human error. To avoid this we here propose a fingerprint based voting system project. We use a fingerprint module interfaced with microcontroller and an LCD screen in this system. The fingerprint module is used to sense fingerprints and provide to microcontroller for further processing. The system has list of eligible voters in it, the voting system tallies the…
Read More

Fingerprint Based Exam Hall Authentication

Sensor
Fingerprint Based Exam Hall Authentication Here we propose a fingerprint based examination hall authentication system. The system is designed to pass only users verified by their fingerprint scan and block non verified users. Our system consists of a fingerprint scanner connected to a microcontroller circuit. In registration mode the system allows to register upto 20 users and save their identity with respective id numbers in the system memory. After storage the person needs to first scan his finger on the scanner. The microcontroller now checks the persons fingerprint validity. If the fingerprint is authorized the microcontroller now sends a signal to a motor driver. The motor driver now operates a motor to open a gate. This ensures only authorized users are allowed to enter the examination section and un authorized…
Read More