Sunday, July 21, 2019
Data Anonymization in Cloud Computing
Data Anonymization in Cloud Computing Data Anonymization Approach For Privacyà Preserving In Cloud Saranya M Abstractââ¬âPrivate data such as electronic health recordsà and banking transactions must be shared within the cloudà environment to analysis or mine data for research purposes. Data privacy is one of the most concerned issues in big dataà applications, because processing large-scale sensitive data setsà often requires computation power provided by public cloudà services. A technique called Data Anonymization, the privacyà of an individual can be preserved while aggregate informationà is shared for mining purposes. Data Anonymization is aà concept of hiding sensitive data items of the data owner. Aà bottom-up generalization for transforming more specific dataà to less specific but semantically consistent data for privacyà protection. The idea is to explore the data generalization fromà data mining to hide detailed data, rather than discovering theà patterns. When the data is masked, data mining techniquesà can be applied without modification. Keywordsââ¬âData Anonymization; Cloud; Bottom Up Generalization; Mapreduce; Privacy Preservation. I. INTRODUCTION Cloud Computing refers to configuring, manipulating,à and accessing the applications through online. It providesà online data storage, infrastructure and application.which isà a disruptive trend which poses a significant impact onà current IT industry and research communities [1]. Cloudà computing provides massive storage capacity computationà power and by utilizing a large number of commodityà computers together. It enable users to deploy applicationsà with low cost, without high investment in infrastructure. Due to privacy and security problem, numerous potentialà customers are still hesitant to take advantage of cloudà [7].However, Cloud computing reduce costs throughà optimization and increased operating and economicà efficiencies and enhance collaboration, agility, and scale, byà enabling a global computing model over the Internetà infrastructure. However, without proper security andà privacy solutions for clouds, this potentially cloudà computing paradigm could become a huge failure. Cloud delivery models are classified into three. They areà software as a service (saas), platform as a service (paas)à and infrastructure as a service (iaas). Saas is very similar toà the old thin-client model of software provision, clientsà where usually web browsers, provides the point of accessà to running software on servers.Paas provides a platform onà which software can be developed and deployed. Iaas isà comprised of highly automated and scalable computerà resources, complemented by cloud storage and networkà capability which can be metered ,self-provisioned andà available on-demand[7]. Cloud is deployed using some models which includeà public, private and hybrid clouds. A public cloud is one inà which the services and infrastructure are provided off-siteà over the Internet. A private cloud is one in which theà services and infrastructure are maintained on a privateà network. Those clouds offer a great level of security. Aà hybrid cloud includes a variety of public and privateà options with multiple providers. Big data environments require clusters of servers toà support the tools that process the large volumes of data,à with high velocity and with varied formats of big data. Clouds are deployed on pools of server, networkingà resources , storage and can scale up or down as needed forà convenience. Cloud computing provides a cost-effective way forà supporting big data techniques and advanced applicationsà that drives business value. Big data analytics is a set ofà advanced technologies designed to work with largeà volumes of data. It uses different quantitative methods likeà computational mathematics, machine learning, robotics,à neural networks and artificial intelligence to explore theà data in cloud. In cloud infrastructure to analyze big data makes senseà because Investments in big data analysis can be significantà and drive a need for efficient and cost-effectiveà infrastructure, Big data combines internal and externalà sources as well as Data services that are needed to extractà value from big data[17]. To address the scalability problem for large scale data setà used a widely adopted parallel data processing frameworkà like Map Reduce. In first phase, the original datasets areà partitioned into group of smaller datasets. Now thoseà datasets are anonymized in parallel producing intermediateà results. In second phase, the obtained intermediate resultsà are integrated into one and further anonymized to achieveà consistent k-anonymous dataset. Mapreduce is a model for programming and Implementingà for processing and generating large data items. A mapà function that processes a key-value pair,This generates aà set of intermediate key-value pair. A reduce function whichà merges all intermediate data values associated with thoseà intermediate key. II. RELATED WORK Ke Wang, Philip S. Yu , Sourav Chakraborty adapts anà bottom-up generalization approach which works iterativelyà to generalize the data. These generalized data is useful forà classification.But it is difficult to link to other sources. Aà hierarchical structure of generalizations specifies theà generalization space.Identifying the best generalization isà the key to climb up the hierarchy at each iteration[2]. Benjamin c. M. Fung, ke wang discuss that privacy preservingà technology is used to solve some problemsà only,But it is important to identify the nontechnicalà difficulties and overcome faced by decision makers whenà deploying a privacy-preserving technology. Theirà concerns include the degradation of data quality, increasedà costs , increased complexity and loss of valuableà information. They think that cross-disciplinary research isà the key to remove these problems and urge scientists in theà privacy protection field to conduct cross-disciplinaryà research with social scientists in sociology, psychology,à and public policy studies[3]. Jiuyong Li,Jixue Liu , Muzammil Baig , Raymond Chi-Wing Wong proposed two classification-aware dataà anonymization methods .It combines local valueà suppression and global attribute generalization. Theà attribute generalization is found by the data distribution,à instead of privacy requirement. Generalization levels areà optimized by normalizing mutual information forà preserving classification capability[17]. Xiaokui Xiao Yufei Tao present a technique,calledà anatomy, for publishing sensitive datasets. Anatomy is theà process of releasing all the quasi-identifier and sensitiveà data items directly in two separate tables. This approachà protect the privacy and capture large amount of correlationà in microdata by Combining with a grouping mechanism. A linear-time algorithm for computing anatomized tablesà that obey the l-diversity privacy requirement is developedà which minimizes the error of reconstructing microdataà [13]. III. PROBLEM ANALYSIS The centralized Top Down Specialization (TDS)à approaches exploits the data structure to improveà scalability and efficiency by indexing anonymous dataà records. But overheads may be incurred by maintainingà linkage structure and updating the statistic informationà when date sets become large.So,centralized approachesà probably suffer from problem of low efficiency andà scalability while handling large-scale data sets. Aà distributed TDS approach is proposed to address theà anonymization problem in distributed system.Ità concentrates on privacy protection rather than scalabilityà issues.This approach employs information gain only, butà not its privacy loss. [1] Indexing data structures speeds up the process ofà anonymization of data and generalizing it, becauseà indexing data structure avoids frequently scanning theà whole data[15]. These approaches fails to work in parallelà or distributed environments such as cloud systems sinceà the indexing structures are centralized. Centralizedà approaches are difficult in handling large-scale data setsà well on cloud using just one single VM even if the VM hasà the highest computation and storage capability. Fung et.al proposed TDS approach which produces anà anonymize data set with exploration problem on data. Aà data structure taxonomy indexed partition [TIPS] isà exploited which improves efficiency of TDS, it fails toà handle large data set. But this approach is centralizedà leasing to in adequacy of large data set. Raj H, Nathuji R, Singh A, England P proposes cacheà hierarchy aware core assignment and page coloring basedà cache partitioning to provide resource isolation and betterà resource management by which it guarantees security ofà data during processing.But Page coloring approachà enforces the performance degradation in case VMââ¬â¢sà working set doesnââ¬â¢t fit in cache partition[14]. Ke Wang , Philip S. Yu considers the followingà problem. Data holder needs to release a version of data thatà are used for building classification models. But the problemà is privacy protection and wants to protect against anà external source for sensitive information. So by adapting the iterative bottom-up generalizationà approach to generalize the data from data mining. IV. METHODOLOGY Suppression: In this method, certain values of theà attributes are replaced by an asterisk *. All or some valuesà of a column may be replaced by * Generalization: In this method, individual values ofà attributes are replaced by with a broader category. Forà example, the value 19 of the attribute Age may beà replaced by âⰠ¤ 20, the value 23 by 20 A. Bottom-Up Generalization Bottom-Up Generalization is one of the efficient kanonymizationà methods. K-Anonymity where theà attributes are suppressed or generalized until each row isà identical with at least k-1 other rows. Now database is saidà to be k-anonymous. Bottom-Up Generalization (BUG)à approach of anonymization is the process of starting fromà the lowest anonymization level which is iterativelyà performed. We leverage privacy trade-off as the searchà metric. Bottom-Up Generalization and MR Bottom upà Generalization (MRBUG) Driver are used. The followingà steps of the Advanced BUG are ,they are data partition, runà MRBUG Driver on data set, combines all anonymizationà levels of the partitioned data items and then applyà generalization to original data set without violating the kanonymity. Fig.1 System architecture of bottom up approachà Here a Advanced Bottom-Up Generalization approachà which improves the scalability and performance of BUG. Two levels of parallelization which is done byà mapreduce(MR) on cloud environment. Mapreduce onà cloud has two levels of parallelization.First is job levelà parallelization which means multiple MR jobs can beà executed simultaneously that makes full use of cloudà infrastructure.Second one is task level parallelizationà which means that multiple mapper or reducer tasks in aà MR job are executed simultaneously on data partitions. Theà following steps are performed in our approach, First theà datasets are split up into smaller datasets by using severalà job level mapreduce, and then the partitioned data sets areà anonymized Bottom up Generalization Driver. Then theà obtained intermediate anonymization levels are Integratedà into one. Ensure that all integrated intermediate level neverà violates K-anonmity property. Obtaining then the mergedà intermediate anonymized dataset Then the driver isà executed on original data set, and produce the resultantà an onymization level. The Algorithm for Advanced Bottomà Up Generalization[15] is given below, The above algorithm describes bottom-up generalization. Inà ith iteration, generalize R by the best generalization Gbest . B. Mapreduce The Map framework which is classified into map andà reduce functions.Map is a function which parcels out taskà to other different nodes in distributed cluster. Reduce is aà function that collates the task and resolves results intoà single value. Fig.2 MapReduce Framework The MR framework is fault-tolerant since each node inà cluster had to report back with status updates andà completed work periodically.For example if a nodeà remains static for longer interval than the expected,then aà master node notes it and re-assigns that task to otherà nodes.A single MR job is inadequate to accomplish task. So, a group of MR jobs are orchestrated in one MR driverà to achieve the task. MR framework consists of MR Driverà and two types of jobs.One is IGPL Initialization and otherà is IGPL Update. The MR driver arranges the execution ofà jobs. Hadoop which provides the mechanism to set globalà variables for the Mappers and the Reducers. The bestà Specialization which is passed into Map function of IGPLà Update job.In Bottom-Up Approach, the data is initializedà first to its current state.Then the generalizations process areà carried out k -anonymity is not violated. That is, we have toà climb the Taxonomy Tree of the attribute till required Anonymity is achieved. 1: while R that does not satisfy anonymity requirement do 2: for all generalizations G do 3: compute the IP(G); 4: end for; 5: find best generalization Gbest; 6: generalize R through Gbest; 7: end while; 8: output R; V. Experiment Evaluation To explore the data generalization from data mining inà order to hide the detailed information, rather to discoverà the patterns and trends. Once the data has been masked, allà the standard data mining techniques can be applied withoutà modifying it. Here data mining technique not only discoverà useful patterns, but also masks the private informationà Fig.3 Change of execution time of TDS and BUGà Fig 3 shows the results of change in execution time ofà TDS and BUG algorithm. We compared the execution timeà of TDS and BUG for the size of EHR ranging from 50 toà 500 MB, keeping p=1. Presenting the bottom-upà generalization for transforming the specific data to lessà specific. Thus focusing on key issues to achieve qualityà and scalability. The quality is addressed by trade-offà information and privacy and an bottom-up generalizationà approach.The scalability is addressed by a novel dataà structure to focus generalizations.To evaluate efficiencyà and effectiveness of BUG approach, thus we compareà BUG with TDS.Experiments are performed in cloudà environment.These approaches are implemented in Javaà language and standard Hadoop MapReduce API. VI. CONCLUSION Here we studied scalability problem for anonymizing theà data on cloud for big data applications by using Bottom Upà Generalization and proposes a scalable Bottom Upà Generalization. The BUG approach performed asà follows,first Data partitioning ,executing of driver thatà produce a intermediate result. After that, these results areà merged into one and apply a generalization approach. Thisà produces the anonymized data. The data anonymization isà done using MR Framework on cloud.This shows thatà scalability and efficiency are improved significantly overà existing approaches. REFERENCES [1] Xuyun Zhang, Laurence T. Yang, Chang Liu, and Jinjun Chen,ââ¬Å"Aà Scalable Two-Phase Top-Down Specialization Approach for Dataà Anonymization Using MapReduce on Cloudâ⬠, vol. 25, no. 2,à february 2014. [2] Ke Wang, Yu, P.S,Chakraborty, S, ââ¬Å" Bottom-up generalization: aà data mining solution to privacy protectionâ⬠[3] B.C.M. Fung, K. Wang, R. Chen and P.S. Yu, ââ¬Å"Privacy-Preservingà Data Publishing: A Survey of Recent Developments,â⬠ACMà Comput. Surv., vol. 42, no. 4, pp.1-53, 2010. [4] K. LeFevre, D.J. DeWitt and R. Ramakrishnan, ââ¬Å"Workload- Awareà Anonymization Techniques for Large-Scale Datasets,â⬠ACM Trans.à Database Syst., vol. 33, no. 3, pp. 1-47, 2008. [5] B. Fung, K. Wang, L. Wang and P.C.K. Hung, ââ¬Å"Privacy- Preservingà Data Publishing for Cluster Analysis,â⬠Data Knowl.Eng., Vol.68,à no.6, pp. 552-575, 2009. [6] B.C.M. Fung, K. Wang, and P.S. Yu, ââ¬Å"Anonymizing Classificationà Data for Privacy Preservation,â⬠IEEE Trans. Knowledge and Dataà Eng., vol. 19, no. 5, pp. 711-725, May 2007. [7] Hassan Takabi, James B.D. Joshi and Gail-Joon Ahn, ââ¬Å"Security andà Privacy Challenges in Cloud Computing Environmentsâ⬠. [8] K. LeFevre, D.J. DeWitt, and R. Ramakrishnan, ââ¬Å"Incognito:à Efficient Full-Domain K-Anonymity,â⬠Proc. ACM SIGMOD Intââ¬â¢là Conf. Management of Data (SIGMOD ââ¬â¢05), pp. 49-60, 2005. [9] T. IwuchukwuandJ.F. Naughton, ââ¬Å"K-Anonymization as Spatialà Indexing: Toward Scalable and Incremental Anonymization,â⬠Proc.à 33rdIntlConf. VeryLarge DataBases (VLDB07), pp.746-757, 2007 [10] J. Dean and S. Ghemawat, ââ¬Å"Mapreduce: Simplified Data Processingà on Large Clusters,â⬠Comm. ACM, vol. 51, no. 1, pp. 107-113,2008. [11] Dean J, Ghemawat S. ââ¬Å"Mapreduce: a flexible data processing tool,â⬠à Communications of the ACM 2010;53(1):72ââ¬â77. DOI:à 10.1145/1629175.1629198. [12] Jiuyong Li, Jixue Liu , Muzammil Baig , Raymond Chi-Wingà Wong, ââ¬Å"Information based data anonymization for classificationà utilityâ⬠[13]X. Xiao and Y. Tao, ââ¬Å"Anatomy: Simple and Effective Privacyà Preservation,â⬠Proc. 32nd Intââ¬â¢l Conf. Very Large Data Basesà (VLDBââ¬â¢06), pp. 139-150, 2006. [14] Raj H, Nathuji R, Singh A, England P. ââ¬Å"Resource management forà isolation enhanced cloud services,â⬠In: Proceedings of theà 2009ACM workshop on cloud computing security, Chicago, Illinois,à USA, 2009, p.77ââ¬â84. [15] K.R.Pandilakshmi, G.Rashitha Banu. ââ¬Å"An Advanced Bottom upà Generalization Approach for Big Data on Cloudâ⬠, Volume: 03, Juneà 2014, Pages: 1054-1059.. [16] Intel ââ¬Å"Big Data in the Cloud: Converging Technologiesâ⬠. [17] Jiuyong Li, Jixue Liu Muzammil Baig, Raymond Chi-Wing Wong,à ââ¬Å"Information based data anonymization for classification utilityâ⬠.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.