FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce

Artificial Intelligence & ML, Data mining, Hadoop
FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce Data mining is a process of discovering the pattern from the huge amount of data. There are many data mining technics like clustering, classification and association rule. The most popular one is the association rule that is divided into two parts i) generating the frequent itemset ii) generating association rule from all itemsets. Frequent itemset mining (FIM) is the core problem in the association rule mining. Sequential FIM algorithm suffers from performance deterioration when it operated on a huge amount of data on a single machine.to address this problem parallel FIM algorithms were proposed. There are two types of algorithms that can be used for mining the frequent itemsets first method is the candidate-itemset generation approach and without candidate itemset generation algorithm.…
Read More

Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework

Hadoop
"Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Framework The buzz-word big-data refers to the large-scale distributed data processing applications that operate on exceptionally large amounts of data. Google’s MapReduce and Apache’s Hadoop, its open-source implementation, are the defacto software systems for big-data applications. An observation of the MapReduce framework is that the framework generates a large amount of intermediate data. Such abundant information is thrown away after the tasks finish, because MapReduce is unable to utilize them. In this paper, we propose Dache, a data-aware cache framework for big-data applications. In Dache, tasks submit their intermediate results to the cache manager. A task queries the cache manager before executing the actual computing work. A novel cache description scheme and a cache request and reply protocol are…
Read More

ClubCF: A Clustering-based Collaborative Filtering Approach for Big Data Application

Hadoop
"ClubCF: A Clustering-based Collaborative Filtering Approach for Big Data Application Spurred by service computing and cloud computing, an increasing number of services are emerging on the Internet. As a result, service-relevant data become too big to be effectively processed by traditional approaches. In view of this challenge, a Clustering-based Collaborative Filtering approach (ClubCF) is proposed in this paper, which aims at recruiting similar services in the same clusters to recommend services collaboratively. Technically, this approach is enacted around two stages. In the first stage, the available services are divided into small-scale clusters, in logic, for further processing. At the second stage, a collaborative filtering algorithm is imposed on one of the clusters. Since the number of the services in a cluster is much less than the total number of the…
Read More

Secure Sensitive Data Sharing on a Big Data Platform

Hadoop
"Secure Sensitive Data Sharing on a Big Data Platform Users store vast amounts of sensitive data on a big data platform. Sharing sensitive data will help enterprises reduce the cost of providing users with personalized services and provide value-added data services. However, secure data sharing is problematic. This paper proposes a framework for secure sensitive data sharing on a big data platform, including secure data delivery, storage, usage, and destruction on a semi-trusted big data sharing platform. We present a proxy re-encryption algorithm based on heterogeneous ciphertext transformation and a user process protection method based on a virtual machine monitor, which provides support for the realization of system functions. The framework protects the security of users’ sensitive data effectively and shares these data safely. At the same time, data owners…
Read More

Load Balancing for Privacy-Preserving Access to Big Data in Cloud

Hadoop
"Load Balancing for Privacy-Preserving Access to Big Data in Cloud In the era of big data, many users and companies start to move their data to cloud storage to simplify data management and reduce data maintenance cost. However, security and privacy issues become major concerns because third-party cloud service providers are not always trusty. Although data contents can be protected by encryption, the access patterns that contain important information are still exposed to clouds or malicious attackers. In this paper, we apply the ORAM algorithm to enable privacy-preserving access to big data that are deployed in distributed file systems built upon hundreds or thousands of servers in a single or multiple geo-distribu ted cloud sites. Since the ORAM algorithm would lead to serious access load unbalance among storage servers, we…
Read More

Enabling Efficient Access Control with Dynamic Policy Updating for Big Data in the Cloud

Hadoop
"Enabling Efficient Access Control with Dynamic Policy Updating for Big Data in the Cloud Due to the high volume and velocity of big data, it is an effective option to store big data in the cloud, because the cloud has capabilities of storing big data and processing high volume of user access requests. Attribute-Based Encryption (ABE) is a promising technique to ensure the end-to-end security of big data in the cloud. However, the policy updating has always been a challenging issue when ABE is used to construct access control schemes. A trivial implementation is to let data owners retrieve the data and re-encrypt it under the new access policy, and then send it back to the cloud. This method incurs a high communication overhead and heavy computation burden on data…
Read More

MRPrePost-A parallel algorithm adapted for mining big data

Hadoop
"MRPrePost-A parallel algorithm adapted for mining big data With the explosive growth in data, using data mining techniques to mine association rules, and then to find valuable information hidden in big data has become increasingly important. Various existing data mmmg techniques often through mining frequent itemsets to derive association rules and access to relevant knowledge, but with the rapid arrival of the era of big data, Traditional data mining algorithms have been unable to meet large data's analysis needs. In view of this, this paper proposes an adaptation to the big data mining parallel algorithms-MRPrePost. MRPrePost is a parallel algorithm based on Hadoop platform, which improves PrePost by way of adding a prefix pattern, and on this basis into the parallel design ideas, making MRPrePost algorithm can adapt to mining…
Read More

Big Data Challenges in Smart Grid IoT (WAMS) Deployment

Hadoop
"Big Data Challenges in Smart Grid IoT (WAMS) Deployment Internet of Things adoption across industries has proven to be beneficial in providing business value by transforming the way data is utilized in decision making and visualization. Power industry has for long struggled with traditional ways of operating and has suffered from issues like instability, blackouts,etc. The move towards smart grid has thus received lot of acceptance. This paper presents the Internet of Things deployment in grid, namely WAMS, and the challenges it present in terms of the Big Data it aggregates. Better insight into the problem is provided with the help of Indian Grid case studies.
Read More

Privacy Preserving Data Analytics for Smart Homes

Hadoop
"Privacy Preserving Data Analytics for Smart Homes A framework for maintaining security & preserving privacy for analysis of sensor data from smart homes, without compromising on data utility is presented. Storing the personally identifiable data as hashed values withholds identifiable information from any computing nodes. However the very nature of smart home data analytics is establishing preventive care. Data processing results should be identifiable to certain users responsible for direct care. Through a separate encrypted identifier dictionary with hashed and actual values of all unique sets of identifiers, we suggest re-identification of any data processing results. However the level of re-identification needs to be controlled, depending on the type of user accessing the results. Generalization and suppression on identifiers from the identifier dictionary before re-introduction could achieve different levels of…
Read More

A data mining framework to analyze road accident data

Hadoop
"A data mining framework to analyze road accident data Road and traffic accidents are uncertain and unpredictable incidents and their analysis requires the knowledge of the factors affecting them. Road and traffic accidents are defined by a set of variables which are mostly of discrete nature. The major problem in the analysis of accident data is its heterogeneous nature [1]. Thus heterogeneity must be considered during analysis of the data otherwise, some relationship between the data may remain hidden. Although, researchers used segmentation of the data to reduce this heterogeneity using some measures such as expert knowledge, but there is no guarantee that this will lead to an optimal segmentation which consists of homogeneous groups of road accidents [2]. Therefore, cluster analysis can assist the segmentation of road accidents.
Read More

Authorized Public Auditing of Dynamic Big Data Storage on Cloud with Efficient Verifiable Fine-grained Updates

Hadoop
"Authorized Public Auditing of Dynamic Big Data Storage on Cloud with Efficient Verifiable Fine-grained Updates Cloud computing opens a new era in IT as it can provide various elastic and scalable IT services in a pay-as-you-go fashion, where its users can reduce the huge capital investments in their own IT infrastructure. In this philosophy, users of cloud storage services no longer physically maintain direct control over their data, which makes data security one of the major concerns of using cloud. Existing research work already allows data integrity to be verified without possession of the actual data file. When the verification is done by a trusted third party, this verification process is also called data auditing, and this third party is called an auditor. However, such schemes in existence suffer from…
Read More

A Time Efficient Approach for Detecting Errors in Big Sensor Data on Cloud

Hadoop
"A data mining framework to analyze road accident data Big sensor data is prevalent in both industry and scientific research applications where the data is generated with high volume and velocity it is difficult to process using on-hand database management tools or traditional data processing applications. Cloud computing provides a promising platform to support the addressing of this challenge as it provides a flexible stack of massive computing, storage, and software services in a scalable manner at low cost. Some techniques have been developed in recent years for processing sensor data on cloud, such as sensor-cloud. However, these techniques do not provide efficient support on fast detection and locating of errors in big sensor data sets. For fast data error detection in big sensor data sets, in this paper, we…
Read More

KASR: A Keyword-Aware Service Recommendation Method on MapReduce for Big Data

Hadoop
"KASR: A Keyword-Aware Service Recommendation Method on MapReduce for Big Data Applications Service recommender systems have been shown as valuable tools for providing appropriate recommendations to users. In the last decade, the amount of customers, services and online information has grown rapidly, yielding the big data analysis problem for service recommender systems. Consequently, traditional service recommender systems often suffer from scalability and inefficien-cy problems when processing or analysing such large-scale data. Moreover, most of existing service recommender systems present the same ratings and rankings of services to different users without considering diverse users' preferences, and therefore fails to meet users' personalized requirements. In this paper, we propose a Keyword-Aware Service Recommendation method, named KASR, to address the above challenges. It aims at presenting a personalized service recommendation list and recommending…
Read More

Big data, big knowledge: big data for personalised healthcare

Hadoop
"Big data, big knowledge: big data for personalised healthcare The idea that the purely phenomenological knowledge that we can extract by analysing large amounts of data can be useful in healthcare seems to contradict the desire of VPH researchers to build detailed mechanistic models for individual patients. But in practice no model is ever entirely phenomenological or entirely mechanistic. We propose in this position paper that big data analytics can be successfully combined with VPH technologies to produce robust and effective in silico medicine solutions. In order to do this, big data technologies must be further developed to cope with some specific requirements that emerge from this application. Such requirements are: working with sensitive data; analytics of complex and heterogeneous data spaces, including non-textual information; distributed data management under security…
Read More

Cost Minimization for Big Data Processing in Geo-Distributed Data Centers

Hadoop
"Cost Minimization for Big Data Processing in Geo-Distributed Data Centers The explosive growth of demands on big data processing imposes a heavy burden on computation, storage, and communication in data centers, which hence incurs considerable operational expenditure to data center providers. Therefore, cost minimization has become an emergent issue for the upcoming big data era. Different from conventional cloud services, one of the main features of big data services is the tight coupling between data and computation as computation tasks can be conducted only when the corresponding data is available. As a result, three factors, i.e., task assignment, data placement and data movement, deeply influence the operational expenditure of data centers. In this paper, we are motivated to study the cost minimization problem via a joint optimization of these three…
Read More

Deduplication on Encrypted Big Data in Cloud

Hadoop
"Deduplication on Encrypted Big Data in Cloud Cloud computing offers a new way of service provision by re-arranging various resources over the Internet. The most important and popular cloud service is data storage. In order to preserve the privacy of data holders, data are often stored in cloud in an encrypted form. However, encrypted data introduce new challenges for cloud data deduplication, which becomes crucial for big data storage and processing in cloud. Traditional deduplication schemes cannot work on encrypted data. Existing solutions of encrypted data deduplication suffer from security weakness. They cannot flexibly support data access control and revocation. Therefore, few of them can be readily deployed in practice. In this paper, we propose a scheme to deduplicate encrypted data stored in cloud based on ownership challenge and proxy…
Read More

Processing Geo-Dispersed Big Data in an Advanced MapReduce Framework

Hadoop
"Processing Geo-Dispersed Big Data in an Advanced MapReduce Framework Big data takes many forms, including messages in social networks, data collected from various sensors, captured videos, and so on. Big data applications aim to collect and analyze large amounts of data, and efficiently extract valuable information from the data. A recent report shows that the amount of data on the Internet is about 500 billion GB. With the fast increase of mobile devices that can perform sensing and access the Internet, large amounts of data are generated daily. In general, big data has three features: large volume, high velocity and large variety [1]. The International Data Corporation (IDC) predicted that the total amount of data generated in 2020 globally will be about 35 ZB. Facebook needs to process about 1.3…
Read More

Recent Advances in Autonomic Provisioning of Big Data Applications on Clouds

Hadoop
"Recent Advances in Autonomic Provisioning of Big Data Applications on Clouds CLOUD computing [1] assembles large networks of virtualized ICT services such as hardware resources (such as CPU, storage, and network), software resources (such as databases, application servers, and web servers) and applications.In industry these services are referred to as infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS). Mainstream ICT powerhouses such as Amazon, HP, and IBM are heavily investing in the provision and support of public cloud infrastructure. Cloud computing is rapidly becoming a popular infrastructure of choice among all types of organisations. Despite some initial security concerns and technical issues, an increasing number of organisations have moved their applications and services in to “The Cloud”. These applications range from generic…
Read More

Privacy Preserving Data Analysis in Mental Health Research

Hadoop
"Privacy Preserving Data Analysis in Mental Health Research The digitalization of mental health records and psychotherapy notes has made individual mental health data more readily accessible to a wide range of users including patients, psychiatrists, researchers, statisticians, and data scientists. However, increased accessibility of highly sensitive mental records threatens the privacy and confidentiality of psychiatric patients. The objective of this study is to examine privacy concerns in mental health research and develop a privacy preserving data analysis approach to address these concerns. In this paper, we demonstrate the key inadequacies of the existing privacy protection approaches applicable to use of mental health records and psychotherapy notes in recordsbased research. We then develop a privacy-preserving data analysis approach that enables researchers to protect the privacy of people with mental illness once…
Read More

BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value Store

Hadoop
"BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value Store Nowadays, cloud-based storage services are rapidly growing and becoming an emerging trend in data storage field. There are many problems when designing an efficient storage engine for cloud-based systems with some requirements such as big-file processing, lightweight meta-data, low latency, parallel I/O, deduplication, distributed, high scalability. Key-value stores played an important role and showed many advantages when solving those problems. This paper presents about Big File Cloud (BFC) with its algorithms and architecture to handle most of problems in a big-file cloud storage system based on keyvalue store. It is done by proposing low-complicated, fixed-size meta-data design, which supports fast and highly-concurrent, distributed file I/O, several algorithms for resumable upload, download and simple data deduplication method for static data. This…
Read More

Performance Analysis of Scheduling Algorithms for Dynamic Workflow Applications

Hadoop
"Performance Analysis of Scheduling Algorithms for Dynamic Workflow Applications In recent years, Big Data has changed how we do computing. Even though we have large scale infrastructure such as Cloud computing and several platforms such as Hadoop available to process the workloads, with Big Data there is a high level of uncertainty that has been introduced in how an application processes the data. Data in general comes in different formats, at different speed and at different volume. Processing consists of not just one application but several applications combined to form a workflow to achieve a certain goal. With data variation and at different speed, applications execution and resource needs will also vary at runtime. These are called dynamic workflows. One can say that we can just throw more and more…
Read More

PaWI: ParallelWeighted Itemset Mining by means of MapReduce

Hadoop
"PaWI: ParallelWeighted Itemset Mining by means of MapReduce Frequent itemset mining is an exploratory data mining technique that has fruitfully been exploited to extract recurrent co-occurrences between data items. Since in many application contexts items are enriched with weights denoting their relative importance in the analyzed data, pushing item weights into the itemset mining process, i.e., mining weighted itemsets rather than traditional itemsets, is an appealing research direction. Although many efficient in-memory weighted itemset mining algorithms are available in literature, there is a lack of parallel and distributed solutions which are able to scale towards Big Weighted Data. This paper presents a scalable frequent weighted itemset mining algorithm based on the MapReduce paradigm. To demonstrate its actionability and scalability, the proposed algorithm was tested on a real Big dataset collecting…
Read More

Building a Big Data Analytics Service Framework for Mobile Advertising and Marketing

Hadoop
"Building a Big Data Analytics Service Framework for Mobile Advertising and Marketing The unprecedented growth in mobile device adoption and the rapid advancement of mobile technologies & wireless networks have created new opportunities in mobile marketing and adverting. The opportunities for Mobile Marketers and Advertisers include real-time customer engagement, improve customer experience, build brand loyalty, increase revenues, and drive customer satisfaction. The challenges, however, for the Marketers and Advertisers include how to analyze troves of data that mobile devices emit and how to derive customer engagement insights from the mobile data. This research paper addresses the challenge by developing Big Data Mobile Marketing analytics and advertising recommendation framework. The proposed framework supports both offline and online advertising operations in which the selected analytics techniques are used to provide advertising recommendations…
Read More

CryptMDB: A Practical Encrypted MongoDB over Big Data

Hadoop
"CryptMDB: A Practical Encrypted MongoDB over Big Data In big data era, data are usually stored in databases for easy access and utilization, which are now woven into every aspect of our lives. However, traditional relational databases cannot address users’ demands for quick data access and calculating, since they cannot process data in a distributed way. To tackle this problem, non-relational databases such as MongoDB have emerged up and been applied in various Scenarios. Nevertheless, it should be noted that most MongoDB products fail to consider user’s data privacy. In this paper, we propose a practical encrypted MongoDB ( i.e., CryptMDB ). Specifically, we utilize an additive homomorphic asymmetric cryptosystem to encrypt user’s data and achieve strong privacy protection. Security analysis indicates that the CryptMDB can achieve confidentiality of user’s…
Read More

Cost Aware Cloudlet Placement for Big Data Processing at the Edge

Hadoop
"Cost Aware Cloudlet Placement for Big Data Processing at the Edge As accessing computing resources from the remote cloud for big data processing inherently incurs high end-toend (E2E) delay for mobile users, cloudlets, which are deployed at the edge of networks, can potentially mitigate this problem. Although load offloading in cloudlet networks has been proposed, placing the cloudlets to minimize the deployment cost of cloudlet providers and E2E delay of user requests has not been addressed so far. The locations and number of cloudlets and their servers have a crucial impact on both the deployment cost and E2E delay of user requests. Therefore, in this paper, we propose the Cost Aware cloudlet PlAcement in moBiLe Edge computing strategy (CAPABLE) to optimize the tradeoff between the deployment cost and E2E delay.…
Read More

Big-Data-Driven Network Partitioning for Ultra-Dense Radio Access Networks

Hadoop
"Big-Data-Driven Network Partitioning for Ultra-Dense Radio Access Networks The increased density of base stations (BSs) may significantly add complexity to network management mechanisms and hamper them from efficiently managing the network. In this paper, we propose a big-data-driven network partitioning and optimization framework to reduce the complexity of the networking mechanisms. The proposed framework divides the entire radio access network (RAN) into multiple sub-RANs and each sub-RAN can be managed independently. Therefore, the complexity of the network management can be reduced. Quantifying the relationships among BSs is challenging in the network partitioning. We propose to extract three networking features from mobile traffic data to discover the relationships. Based on these features, we engineer the network partitioning solution in three steps. First, we design a hierarchical clustering analysis (HCA) algorithm to…
Read More

Big Data Set Privacy Preserving through Sensitive Attribute-based Grouping

Hadoop
"Big Data Set Privacy Preserving through Sensitive Attribute-based Grouping There is a growing trend towards attacks on database privacy due to great value of privacy information stored in big data set. Public’s privacy are under threats as adversaries are continuously cracking their popular targets such as bank accounts. We find a fact that existing models such as K-anonymity, group records based on quasi-identifiers, which harms the data utility a lot. Motivated by this, we propose a sensitive attribute-based privacy model. Our model is the early work of grouping records based on sensitive attributes instead of quasi-identifiers which is popular in existing models. Random shuffle is used to maximize information entropy inside a group while the marginal distribution maintains the same before and after shuffling, therefore, our method maintains a better…
Read More

Big Data Driven Information Diffusion Analysis and Control in Online Social Networks

Hadoop
"Big Data Driven Information Diffusion Analysis and Control in Online Social Networks Thanks to recent advance in massive social data and increasingly mature big data mining technologies, information diffusion and its control strategies have attracted much attention, which play pivotal roles in public opinion control, virus marketing as well as other social applications. In this paper, relying on social big data, we focus on the analysis and control of information diffusion. Specifically, we commence with analyzing the topological role of the social strengths, i.e., tie strength, partial strength, value strength, and their corresponding symmetric as well as asymmetric forms. Then, we define two critical points for the cascade information diffusion model, i.e., the information coverage critical point (CCP) and the information heat critical point (HCP). Furthermore, based on the two…
Read More

Big Data Analytics of Geosocial Media for Planning and Real-Time Decisions

Hadoop
"Big Data Analytics of Geosocial Media for Planning and Real-Time Decisions Geosocial Network data can be served as an asset for the authorities to make real-time decisions and future planning by analyzing geosocial media posts. However, there are millions of Geosocial Network users who are producing overwhelming of data, called “Big Data” that is challenging to be analyzed and make real-time decisions. Therefore, in this paper, we proposed an efficient system for exploring Geosocial Networks while harvesting data as well as user’s location information. A system architecture is proposed that processes an abundant amount of various social networks’ data to monitor Earth events, incidents, medical diseases, user trends, and views to make future real-time decisions and facilitate future planning. The proposed system consists of five layers, i.e., data collection, data…
Read More

An Approximate Search Framework for Big Data

Hadoop
"An Approximate Search Framework for Big Data In the age of big data, a traditional scanning search pattern is gradually becoming unfit for a satisfying user experience due to its lengthy computing process. In this paper, we propose a sampling-based approximate search framework called Hermes, to meet user’s query demand for both accurate and efficient results. A novel metric, (ε, δ)-approximation, is presented to uniformly measure accuracy and efficiency for a big data search service, which enables Hermes to work out a feasible searching job. Based on this, we employ the bootstrapping technique to further speed up the search process. Moreover, an incremental sampling strategy is investigated to process homogeneous queries; in addition, the reuse theory of historical results is also studied for the scenario of appending data. Theoretical analyses…
Read More

A Reliable Task Assignment Strategy for Spatial Crowdsourcing in Big Data Environment

Hadoop
"A Reliable Task Assignment Strategy for Spatial Crowdsourcing in Big Data Environment With the ubiquitous deployment of the mobile devices with increasingly better communication and computation capabilities, an emerging model called spatial crowdsourcing is proposed to solve the problem of unstructured big data by publishing location-based tasks to participating workers. However, massive spatial data generated by spatial crowdsourcing entails a critical challenge that the system has to guarantee quality control of crowdsourcing. This paper first studies a practical problem of task assignment, namely reliability aware spatial crowdsourcing (RA-SC), which takes the constrained tasks and numerous dynamic workers into consideration. Specifically, the worker confidence is introduced to reflect the completion reliability of the assigned task. Our RA-SC problem is to perform task assignments such that the reliability under budget constraints is…
Read More

A Micro-video Recommendation System Based on Big Data

Hadoop
A Micro-video Recommendation System Based on Big Data With the development of the Internet and social networking service, the micro-video is becoming more popular, especially for youngers. However, for many users, they spend a lot of time to get their favorite micro-videos from amounts videos on the Internet; for the micro-video producers, they do not know what kinds of viewers like their products. Therefore, this paper proposes a micro-video recommendation system. The recommendation algorithms are the core of this system. Traditional recommendation algorithms include content-based recommendation, collaboration recommendation algorithms, and so on. At the Bid Data times, the challenges what we meet are data scale, performance of computing, and other aspects. Thus, this paper improves the traditional recommendation algorithms, using the popular parallel computing framework to process the Big Data.…
Read More

A Queuing Method for Adaptive Censoring in Big Data Processing

Hadoop
"A Queuing Method for Adaptive Censoring in Big Data Processing As more than 2.5 quintillion bytes of data are generated every day, the era of big data is undoubtedly upon us. Running analysis on extensive datasets is a challenge. Fortunately, a significant percentage of the data accrued can be omitted while maintaining a certain quality of statistical inference in many cases. Censoring provides us a natural option for data reduction. However, the data chosen by censoring occur nonuniformly, which may not relieve the computational resource requirement. In this paper, we propose a dynamic, queuing method to smooth out the data processing without sacrificing the convergence performance of censoring. The proposed method entails simple, closed-form updates, and has no loss in terms of accuracy comparing to the original adaptive censoring method.Simulation…
Read More

Twitter data analysis and visualizations using the R language on top of the Hadoop platform

Hadoop
"Twitter data analysis and visualizations using the R language on top of the Hadoop platform The main objective of the work presented within this paper was to design and implement the system for twitter data analysis and visualization in R environment using the big data processing technologies. Our focus was to leverage existing big data processing frameworks with its storage and computational capabilities to support the analytical functions implemented in R language. We decided to build the backend on top of the Apache Hadoop framework including the Hadoop HDFS as a distributed filesystem and MapReduce as a distributed computation paradigm. RHadoop packages were then used to connect the R environment to the processing layer and to design and implement the analytical functions in a distributed manner. Visualizations were implemented on…
Read More

Achieving Efficient and Privacy-Preserving Cross-Domain Big Data Deduplication in Cloud

Hadoop
"Achieving Efficient and Privacy-Preserving Cross-Domain Big Data Deduplication in Cloud Secure data deduplication can significantly reduce the communication and storage overheads in cloud storage services, and has potential applications in our big data-driven society. Existing data deduplication schemes are generally designed to either resist brute-force attacks or ensure the efficiency and data availability, but not both conditions. We are also not aware of any existing scheme that achieves accountability, in the sense of reducing duplicate information disclosure (e.g., to determine whether plaintexts of two encrypted messages are identical). In this paper, we investigate a three-tier cross-domain architecture, and propose an efficient and privacy-preserving big data deduplication in cloud storage (hereafter referred to as EPCDD). EPCDD achieves both privacy-preserving and data availability, and resists brute-force attacks. In addition, we take accountability…
Read More

QoS-Aware Data Replications and Placements for Query Evaluation of Big Data Analytics

Hadoop
"QoS-Aware Data Replications and Placements for Query Evaluation of Big Data Analytics Enterprise users at different geographic locations generate large-volume data and store their data at different geographic datacenters. These users may also issue ad hoc queries of big data analytics on the stored data to identify valuable information in order to help them make strategic decisions. However, it is well known that querying such large-volume big data usually is time-consuming and costly. Sometimes, users are only interested in timely approximate rather than exact query results. When this approximation is the case, applications must sacrifice either timeliness or accuracy by allowing either the latency of delivering more accurate results or the accuracy error of delivered results based on the samples of the data, rather than the entire set of data…
Read More

A Profile-Based Big Data Architecture for Agricultural Context

Hadoop
"A Profile-Based Big Data Architecture for Agricultural Context Bringing Big data technologies into agriculture presents a significant challenge; at the same time, this technology contributes effectively in many countries’ economic and social development. In this work, we will study environmental data provided by precision agriculture information technologies, which represents a crucial source of data in need of being wisely managed and analyzed with appropriate methods and tools in order to extract the meaningful information. Our main purpose through this paper is to propose an effective Big data architecture based on profiling system which can assist (among others) producers, consulting companies, public bodies and research laboratories to make better decisions by providing them real time data processing, and a dynamic big data service composition method, to enhance and monitor the agricultural…
Read More

Traffic-aware Task Placement with Guaranteed Job Completion Time for Geo-distributed Big Data

Hadoop
"Traffic-aware Task Placement with Guaranteed Job Completion Time for Geo-distributed Big Data Big data analysis is usually casted into parallel jobs running on geo-distributed data centers. Different from a single data center, geo-distributed environment imposes big challenges for big data analytics due to the limited network bandwidth between data centers located in different regions.Although research efforts have been devoted to geo-distributed big data, the results are still far from being efficient because of their suboptimal performance or high complexity. In this paper, we propose a traffic-aware task placement to minimize job completion time of big data jobs. We formulate the problem as a non-convex optimization problem and design an algorithm to solve it with proved performance gap. Finally, extensive simulations are conducted to evaluate the performance of our proposal. The…
Read More

Review Based Service Recommendation for Big Data

Hadoop
"Review Based Service Recommendation for Big Data Success of web 2.0 brings online information overload. An exponential growth of customers, services and online information has been observed in last decade. It yields big data investigation problem for service recommendation system. Traditional recommender systems often put up with scalability, lack of security and efficiency problems. Users preferences are almost ignored. So, the requirement of robust ecommendation system is enhanced now a days. In this paper, we present review based service recommendation to dynamically recommend services to the users. Keywords are extracted from passive users reviews and a rating value is given to every new keyword observed in the dataset. Sentiment analysis is performed on these rating values and top-k services recommendation list is provided to users. To make the system more…
Read More

Online Data Deduplication for In-Memory Big-Data Analytic Systems

Hadoop
"Online Data Deduplication for In-Memory Big-Data Analytic Systems Given a set of files that show a certain degree of similarity, we consider a novel problem of performing data redundancy elimination across a set of distributed worker nodes in a shared-nothing in-memory big data analytic system. The redundancy elimination scheme is designed in a manner that is: (i) space-efficient: the total space needed to store the files is minimized and, (ii) access-isolation: data shuffling among server is also minimized. In this paper, we first show that finding an access-efficient and space optimal solution is an NP-Hard problem. Following this, we present the file partitioning algorithms that locate access-efficient solutions in an incremental manner with minimal algorithm time complexity (polynomial time). Our experimental verification on multiple data sets confirms that the proposed…
Read More

Novel Common Vehicle Information Model (CVIM) for Future Automotive Vehicle Big Data Marketplaces

Hadoop
"Novel Common Vehicle Information Model (CVIM) for Future Automotive Vehicle Big Data Marketplaces Even though connectivity services have been introduced in many of the most recent car models, access to vehicle data is currently limited due to its proprietary nature. The European project AutoMat has therefore developed an open Marketplace providing a single point of access for brandindependent vehicle data. Thereby, vehicle sensor data can be leveraged for the design and implementation of entirely new services even beyond traffic-related applications (such as hyperlocal traffic forecasts). This paper presents the architecture for a Vehicle Big Data Marketplace as enabler of cross-sectorial and innovative vehicle data services. Therefore, the novel Common Vehicle Information Model (CVIM) is defined as an open and harmonized data model, allowing the aggregation of brandindependent and generic data…
Read More

Holistic Perspective of Big Data in Healthcare

Hadoop
"Holistic Perspective of Big Data in Healthcare Healthcare has increased its overall value by adopting big data methods to analyze and understand its data from various sources. This article presents big data from the perspective of improving healthcare services and, also, offers a holistic view of system security and factors determining security breaches.
Read More

Focusing on a Probability Element: Parameter Selection of Message Importance Measure in Big Data

Hadoop
"Focusing on a Probability Element: Parameter Selection of Message Importance Measure in Big Data Message importance measure (MIM) is applicable to characterize the importance of information in the scenario of big data, similar to entropy in information theory. In fact, MIM with a variable parameter can make an effect on the characterization of distribution. Furthermore, by choosing an appropriate parameter of MIM, it is possible to emphasize the message importance of a certain probability element in a distribution. Therefore, parametric MIM can play a vital role in anomaly detection of big data by focusing on probability of an anomalous event. In this paper, we propose a parameter selection method of MIM focusing on a probability element and then present its major properties. In addition, we discuss the parameter selection with…
Read More