- Open Access
- Authors : Tamanda Mgwalima , Arthur Ndlovu
- Paper ID : IJERTV11IS120020
- Volume & Issue : Volume 11, Issue 12 (December 2022)
- Published (First Online): 12-12-2022
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
A Review of the Past, Present, and Future of Secure Data Deduplication
Tamanda Mgwalima
School of Information Science and Technology Department Harare Institute of Technology, Zimbabwe
Arthur Ndlovu
School of Information Science and Technology Department Harare Institute of Technology, Zimbabwe
AbstractIn todays digital world, many enterprises, organizations and individuals have chosen to outsource their data to cloud storage providers to reduce the burden of maintaining enormous amounts of data. Storage optimization techniques have become an essential requirement in cloud storage and many cloud storage providers perform de-duplication which avoids storing duplicate data copies from multiple users. Cloud subscribers do not rely on service providers for the security of their data. To ensure data confidentiality, data is first encrypted by cloud subscribers before being outsourced to the cloud and one problem with that is, we cannot apply deduplication on encrypted data. Encryption of the same data using different keys (by different subscribers) will result in different ciphertexts that will not allow the Cloud Service Provider (CSP) to carry out deduplication. Performing deduplication over encrypted data securely in the cloud is a challenging task. Various secure deduplication methods to overcome this challenge have been researched and in this paper we review these state-of-the-art methods. Finally, we outline future research directions facing deduplication-based storage systems.
KeywordsCloud Storage, Deduplication, Proof of Ownership, Convergent Encryption;
-
INTRODUCTION
According to NIST, cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.[1]
The cloud allows users to store enormous amounts of data, which can be retrieved as and when required. Google Drive [2], SugarSync [3], OpenDrive [4] and Amazon S3 [5] are a few examples of cloud storage offerings. In general, Cloud Service Providers (CSPs) store a single copy of identical data received from multiple sources to optimize space. However, CSPs cannot distinguish identical data when the clients upload the data in an encrypted form using different keys. Performing encryption is essential to ensure the confidentiality of data, at the same time, performing deduplication is essential for achieving optimized storage. Hence, deduplication and encryption need to work in hand to hand to ensure data confidentiality and optimized storage. Various techniques and approaches used for deduplication over encrypted data are studied in this paper.
-
LITERATURE REVIEW OF SECURE DATA DEDUPLICATION METHODS
Douceur et al [6] first provided a solution for deduplication over encrypted data aiming to achieve both data confidentiality and deduplication. In their scheme, convergent encryption was
introduced to derive the encryption key from the hash of the plaintext. Two users with two identical plaintexts will then obtain two identical ciphertexts since the encryption key is the same; hence the cloud storage provider will be able to perform deduplication on such ciphertexts and keep one instance in storage. However, since convergent encryption is a deterministic encryption scheme, it suffers against brute-force dictionary attacks if the file comes from a predictable plaintext space. Since the proposal of convergent encryption, many other schemes for encrypted-data deduplication have been proposed in the literature. Convergent encryption is also susceptible to many attacks like confirmation of a file attack and learn the remaining information attack. In confirmation of a file attack, anyone who owns the same file will have the potential to prove that another user also possesses the same file.
Bellare, Keelveedhi and Ristenpart [7] formalized Message- Locked Encryption (MLE), a cryptographic primitive where the key used for encryption and decryption is derived from the message itself. However, MLE scheme is vulnerable to brute- force attacks. Bellare et al. firstly proposed DupLESS [8] to resist the above-mentioned brute-force attacks. The scheme introduced a key server and implemented rate-limiting strategy to resist brute-force attacks.
V Chouhan, S Peddoju and R Buyya [9] propose a secure and reliable cloud storage framework to deduplicate the encrypted data and key (dualDup framework) that optimizes the storage by eliminating the duplicate encrypted data from multiple users by extending DupLESS concept, and securely distributes the data and key fragments to achieve the privacy and reliability using Erasure Coding scheme.
Halevi, Harnik, Pinkas, and Shulman-Peleg [10] addressed the security problem in client-side deduplication whereby an entire file is represented by a hash value, and an adversary who gets this hash value can claim ownership of the file by [10] proposing the concept of Proofs of ownership (PoW) based on Merkle tree. The cloud server computes the Merkle tree of the uploaded file and stores the root hash value of the Merkle tree, then the clients prove file ownership by computing the corresponding sibling paths based on the cloud server request. If the clients compute requested paths correctly, the cloud server considers that they are in the possession of the file. However, this scheme has a large computational overhead. Following [10], Pietro and Sorniotti [11] propose another efficient PoW scheme by choosing the projection of a file onto some randomly selected bit-positions as the file proof. They addressed a major security risk in existing proof of ownership schemes where an adversary in possession of a fraction of the original file is able to claim possession of the entire file.
Authors in [12] propose a PoW scheme based on bloom filter which is flexible and scalable. This scheme is more
efficient at the client side than the approach in [10] and more efficient at the server side than the scheme in [11].
Li et.al [13] first addressed the problem of achieving efficient and reliable key management when implementing convergent encryption. They proposed a baseline approach in which a user encrypts a file with a convergent key and uses a unique independent master key for encrypting the convergent keys. The encrypted data is then outsourced to the cloud server for storage. This approach generates a huge number of keys with the increasing number of users and hence an enormous amount of key storage space. Another challenge of this baseline approach is that if the master key is compromised, the stored data cannot be recovered, hence the master key is a single point of failure. To overcome this challenge, they also proposed Dekey, a new construction that implements Ramp secret sharing scheme in which users do not need to manage any keys on their own but instead securely distribute the convergent key shares across multiple servers. Singh et al [14] also proposed an approach to eliminate this single point of failure by distributing the convergent keys into multiple random looking shares using the Chinese Remainder Theorem based secret sharing and sending the shares across multiple key management servers. The key can also be recovered when a threshold number of shares is obtained from the key management server by execution of POW protocols. In their scheme, encrypted data is distributed at multiple servers into random looking shares based on the Permutation ordered binary number system.
uzio et al [15] proposed ClouDedup, which in addition to the basic storage provider, a metadata manager and an additional server are defined. The server adds an additional encryption layer to prevent well-known attacks against convergent encryption and thus protect the confidentiality of the data; on the other hand, the metadata manager is responsible of the key management task since block-level deduplication requires the memorization of a huge number of keys.
PoW schemes worked well when the file is in plaintext. However, the privacy of the clients data may be vulnerable to honest-but-curious attacks. To deal with this issue, the clients tend to encrypt files before outsourcing them to the cloud, which makes the existing PoW schemes inapplicable any more. Chao Yang [16] first proposed a secure zero-knowledge based client side deduplication scheme over encrypted files. The scheme achieves a high detection probability of the clients misbehavior. They introduced a proxy re-encryption based key distribution scheme which ensures that the server knows nothing about the encryption key even though it acts as a proxy to help distributing the file encryption key. It also enables the clients who have gained the ownership of a file to share the file with the encryption key generated without establishing secure channels among them. Clients private key cannot be recovered by the server or clients collusion attacks during the key distribution phase.
Authors in [17] propose a secure and efficient client-side encrypted deduplication scheme (CSED) built upon message locked encryption (MLE), where a dedicated key server is employed to assist clients in generating MLE keys without leaking any information about the data to be encrypted to the key server. They integrated a Bloom filter-based proof of ownership (PoW) mechanism into CSED to resist illegal content distribution attacks.
Most secure deduplication schemes presume that all files need equal security, however, Stanek and Kencl [18] proposes a scheme which provides security to the data on basis of their popularity. Data owned by many cloud subscribers is known as popular data and data owned by a few subscribers is called unpopular data. Semantic security is provided for unpopular files and convergent encryption is used for popular files. Authors in [19] proposed PerfectDedup, to counter the weaknesses in convergent encryption by taking into account the popularity of the data segments. PerfectDedup takes advantage of the properties of Perfect Hashing in order to assure block-level deduplication and data confidentiality simultaneously.
C Guo, X Jiang, K R Choo and Y Jie [20] proposed R-Dedup, a randomized, secure, client-side deduplication that does not rely on an external third-party or require assistance from other peer users. By sharing a random value used to generate an encryption key for users who hold the same copy of a file, R- Dedup can resist brute-force attacks from both malicious cloud servers and subscribers. Data Verification in R-Dedup ensures data integrity and provides user authentication for the cloud server. Less computation overhead occurs at the client side since the complex calculations are handled by the cloud server.
[21] proposed DedupDUM, a deduplication scheme with dynamic user management which updates dynamic group users in a secure way and restricts unauthorised cloud users from sensitive data owned by valid users. Their scheme supports user revocation and new cloud user joining by exploiting re- encryption techniques and does not require a fully trusted third party.Geeta C M et al [22] proposed Secure deduplication and virtual auditing of data in the cloud (SDVADC). SDVADC supports secure deduplication of information and effective virtual auditing of the files during the download process. The approach lowers the burden of data owner to audit files by himself or a third party auditor. They proposed an algorithm for file uploading and virtual auditing that allows a user to encrypt a file using randomized convergent encryption and outsource the ciphertext to the distributed server in the cloud service provider premises. The metadata information of the file is sent to the Virtual Auditing Entity (VAE) that consists of the metadata information of all the files uploaded to the distributed server. The CSP accepts the file, and checks for duplication. If the file is original, the CSP saves the file in storage and if the file is a duplicate, the CSP runs PoW convention with the user. When the user proves that he is an authorized person, then the CSP provides a link for the file existing in storage. During file download, the cloud user transmits a file request query to the CSP. The CSP sends the requested file to the VAE. The virtual auditing framework performs auditing of this file and sends the file attached with the auditing report to the user that shows whether the file has been modified or not.
R Miguel, K Mi and M Aung [23] proposed HEDup, a scheme that performs deduplication on encrypted data, with the aid of a key server deployed at the cloud service provider premises. Subscribers obtains data-encryption key from the key server through some homomorphic searching operations.
[24] propose a cloud data deduplication scheme based on certificateless proxy reencryption. Certificateless cryptography is applied to solve the problem of key escrow and to avoid situations where a key generation centre impersonates a cloudsubscriber to decrypt ciphertext. In this scheme, a convergent key is used to encrypt the plaintext and the resulting ciphertext is encrypted again using certificateless proxy reencryption and stored in the cloud server together with the reencryption key. The reencryption key is used to share the encrypted data with other cloud subscribers authenticated by the POW scheme based on certificateless signature.
Authors in [25] propose an identity based proxy re-encryption scheme for cloud data deduplication. This scheme integrates cloud deduplication with access control unlike other solutions which cannot support flexible data access control and require data users to remain online. It efficiently distributes ciphertexts and revoke data owners who delete their data from the cloud
without participation of the data owners unlike in certificateless proxy reencryption.
-
COMPARATIVE ANALYSIS
Client-side deduplication has been applied in most of the existing deduplication than server side deduplication. Table 1 gives the advantages and limitations of a few selected deduplication schemes and the methodologies behind the schemes.
Paper Title
Proposed Idea
Methodology
Advantages
Limitations
DupLESS : Server- Aided Encryption for Deduplicated Storage[8]
To resist brute force attack on Message Locked Encryption
The scheme combines has the ability to obtain message-derived keys with the help of a key server shared amongst a group of clients. The clients interact with the key server by a protocol for oblivious PRFs, ensuring that the key server can cryptographically mix in secret material to the per-message keys while learning nothing about files stored by clients.
It provides more security than convergent encryption
When there is significantly less deduplication across the corpus, DupLESS may introduce greater overhead.
ClouDedup : Secure
To prevent well known
Secret keys can be generated in a hardware-
It prevents curious
If the third party server is
Deduplication with
attacks against convergent
dependent way by the device itself
cloud storage providers
comprised, it may lead a
Encrypted Data for
encryption
Server encryption is
from inferringthe
Man-in-the-Middle attack
Cloud Storage[15]
applied on top of convergent encryption
original content of
to the users.
performed by user
stored data by
observing access
patterns or accessing
metadata.
It gains in terms of
storage space are not
affected by the
overhead of metadata
management, which is
minimal.
Secure Data
Data can be differentiated
Semantic security is provided for all the
The users no longer
Lower deduplication ratio.
Deduplication in
based on popularity to
unpopular data whereas for popular data the
need to manually
High computation cost
Cloud Storage
determine the level of
security is slightly weaker convergent
classify sensitive files.
Services Doctoral
privacy offered.
encryption is provided for popular data
The transition between
Thesis[18]
unpopular and popular
state is automatic and
does not require active
user participation
CSED : Client-Side
To construct a secure and
Client-side encrypted deduplication scheme
Reduced storage and
No integrity auditing.
encrypted
efficient scheme that
based on proofs of ownership (PoW) that
communication
deduplication
resists brute-force attacks
resists brute-force attacks. A key server is
overhead.
scheme based on
and illegal content
employed to assist the clients in generating
Secure against brute
proofs of ownership
distribution attacks, where
the MLE keys and adopt the rate-limiting
force attacks.
for cloud storage
the adversary can
strategy to prevent brute-force attacks. Bloom
[17] distribute data to other
filter and hierarchical strategy on cloud
users via the cloud server.
storage to improve the efficiency.
HEDup : Secure
Deduplication on
Allow deduplication on encrypted data with
Data uploads and
Key server discussed in
Deduplication with
encrypted data, with the
the aid of a key server deployed at cloud
downloads using
this approach may become
Homomorphic
aid of a key server
service. The subscriber encrypts data with
HEDup have minor
a bottleneck when number
Encryption[23]
deployed at the CSP
data-encryption key obtained from key server
storage and latency
of clients increase in case
premises. Client obtains
via various key-management schemes, one of
overhead.
of large scale deployment,
the encryption key from
which uses homomorphic encryption. The
Data owners still
and a decentralized
the key server through
key server deployed at cloud provider
maintain exclusive
deployment of key server
some homomorphic
premises, it will not only deduplicate data
control of their data
is supposed as a solution.
searching operations. Data
from particular domain but also for the CSP's
and data-encryption
owners maintain exclusive
entire client base including public and
keys, i.e. CSP has no
control of their data and
different enterprise users
access to any of it –
cloud providers has no
access to any of it.
R-Dedup: Secure
A randomized, secure,
ElGamal Encryption technique
provides user
computation
client-side
cross-user deduplication
Sharing a random
authentication and
overhead on the client side
deduplication for
scheme that does not
value used to generate encryption key for
integrity of data
encrypted data
involve any third-party
users
without involving a
entity or require assistance
third-party entity[20]
from other users. In
Zero knowledge
A secure zero-knowledge
Scheme enables a client to prove its file
Great detection
based client side
based client side
ownership via the original file without
probability of clients
deduplication for
deduplication scheme over
leaking any information to the server. Key
misbehaviour
encrypted files of
encrypted files is
distribution scheme is based on proxy re-
secure cloud storage
proposed.
encryption which realizes delegation of
in smart cities[16]
decryption rights.
Secure
To efficiently and reliably
Scheme implements Ramp secret sharing
It incurs small
No integrity auditing
Deduplication with
manage a huge number of
scheme in which users do not need to manage
encoding/decoding
Efficient and
convergent keys. Users do
any keys on their own but instead securely
overhead com- pared to
Reliable Convergent
not need to manage any
distribute the convergent key shares across
the network
Key
keys on their own but
multiple servers.
transmission overhead
Management[13]
instead securely distribute
in the regular
the convergent key shares
upload/download
across multiple servers.
operations
An identity-based
To design a scheme that
The scheme saves only one copy of data in
The scheme
Slightly higher
proxy re-encryption
can flexibly support
the cloud for the initial uploader and
successfully flexibly
computation overhead of
for data
ciphertext distribution
subsequent data owners who will have passed
supports ciphertext
re-encryption
deduplication in
even when the data owner
the proof of ownership will require
distribution even when
cloud[25]
is offline
conversion of this data into ciphertext that
the data owner is
can be decrypted with their own private keys.
offline
(Proxy re-encryption).
Authorized Client-
To allow only authorized
Ciphertext-Policy
Less Authorization
time complexity
Side
users to access critical
Attribute-Based
Servers burden and
Deduplication Using
data. To provide control
Encryption (CPABE)
less storage overhead
CP-ABE in Cloud
over access permissions in
Storage [26]
an encrypted deduplication
storage
SecDep: A User-
To resist brute force attack
User-Aware Convergent Encryption (UACE)
Time efficient
No integrity auditing and
Aware Efficient
and convergent key space
and Multi-Level Key
and key-space-efficient
public verifiability
Fine-Grained Secure
overhead.
management (MLK)
Deduplication
Scheme
with Multi-Level
Key Management
[27] -
Table 1: Comparative Analysis of deduplication techniques
-
-
CONCLUSION
Various secure deduplication techniques for providing in cloud storage have been discussed. Client-side has comparatively more benefits than server side Deduplication and hence it has been applied in most of the existing deduplication tools. This paper discussed the various secure data deduplication techniques, and outlined a comparative analysis of some of the different existing client-side deduplication schemes are done. Future enhancements might be designing a scheme capable of verifying integrity of data without downloading it from server, with reduced computation complexity and that supports insertion, deletion and updation operations, private verifiability, public verifiability and batch auditing and searchable encryption.
REFERENCES
[1] C. D. National Institute of Standards and Technology (NIST), The NIST Definition of Cloud Computing, govinfo.gov, Jan. 2011, Accessed: Oct. 25, 2022. [Online]. Available: https%3A%2F%2Fwww.govinfo.gov%2Fapp%2Fdetails%2FGOV PUB-C13-74cdc274b1109a7e1ead7185dfec2ada [2] My Drive – Google Drive. https://drive.google.com/drive/my- drive (accessed Nov. 28, 2022). [3] Cloud File Sharing, File Sync & Online Backup From Any Device | SugarSync | SugarSync. https://www1.sugarsync.com/ (accessed Nov. 28, 2022). [4] G. Xu, M. Lai, J. Li, L. Sun, and X. Shi, Open Access A generic integrity verification algorithm of version files for cloud deduplication data storage, 2018. [5] Cloud Object Storage Amazon S3 Amazon Web Services. https://aws.amazon.com/s3/ (accessed Nov. 28, 2022). [6] J. R. Douceur, A. Adya, W. J. Bolosky, D. Simon, and M. Theimer, Reclaiming space from duplicate files in a serverless distributed file system, Proc. – Int. Conf. Distrib. Comput. Syst., pp. 617624, 2002, doi: 10.1109/ICDCS.2002.1022312. [7] M. Bellare, Message-Locked Encryption and Secure Deduplication, pp. 129, 2013. [8] M. Bellare, S. Keelveedhi, S. Diego, T. Ristenpart, W. Madison, and T. Ristenpart, DupLESS: Server-Aided Encryption for Deduplicated Storage, 2013. [9] V. Chouhan, S. K. Peddoju, and R. Buyya, Journal of Information Security and Applications dualDup: A secure and reliable cloud storage framework to deduplicate the encrypted data and key, vol. 69, no. July, 2022. [10] S. Halevi, D. Harnik, B. Pinkas, and A. Shulman-peleg, Proofs ofOwnership in Remote Storage Systems , pp. 113, 2013.
[11] R. Di Pietro and A. Sorniotti, Boosting efficiency and security inproof of ownership for deduplication, 2012.
[12] J. Blasco, A. Orfila, R. D. I. Pietro, and A. Sorniotti, A Tunable Proof of Ownership Scheme for Deduplication Using Bloom Filters, pp. 118, 2014. [13] J. Li, X. Chen, M. Li, J. Li, P. P. C. Lee, and W. Lou, Secure Deduplication with Efficient and Reliable Convergent Key Management, vol. 25, no. 6, pp. 16151625, 2014. [14] P. Singh, N. Agarwal, and B. Raman, Secure data deduplication using secret sharing schemes over cloud, Futur. Gener. Comput. Syst., vol. 88, pp. 156167, 2018, doi: 10.1016/j.future.2018.04.097. [15] P. Puzio and S. Loureiro, ClouDedup: Secure Deduplication with Encrypted Data for Cloud Storage. [16] C. Yang, M. Zhang, Q. Jiang, J. Zhang, and D. Li, Zero knowledge based client side deduplication for encrypted files of secure cloud storage in smart cities , Pervasive Mob. Comput., vol. 41, pp. 243258, 2017, doi: 10.1016/j.pmcj.2017.03.014. [17] S. Li, C. Xu, and Y. Zhang, Journal of Information Security and Applications CSED: Client-Side encrypted deduplication scheme based on proofs of ownership for cloud storage, vol. 46, pp. 250 258, 2019, doi: 10.1016/j.jisa.2019.03.015. [18] D. Thesis and I. Technology, Secure Data Deduplication in Cloud Storage Services Doctoral Thesis, no. March, 2018. [19] P. Puzio, R. Molva, M. Onen, and S. Loureiro, PerfectDedup: Secure Data Deduplication, vol. 2020, no. 644412, pp. 118. [20] C. Guo, X. Jiang, K. R. Choo, and Y. Jie, Journal of Network and Computer Applications R-Dedup: Secure client-side deduplication for encrypted data without involving a third-party entity, J. Netw. Comput. Appl., vol. 162, no. April, p. 102664, 2020, doi: 10.1016/j.jnca.2020.102664. [21] H. Yuan, X. Chen, T. Jiang, X. Zhang, and Y. Xiang, DedupDUM: Secure and Scalable Data Deduplication With Dynamic User Management, Inf. Sci. (Ny)., vol. 456, 2018, doi: 10.1016/j.ins.2018.05.024. [22] C. M. Geeta, S. R. R. G, and K. R. Venugopal, ScienceDirect SDVADC: SDVADC: Secure Secure Deduplication Deduplication and and Virtual Virtual Auditing Auditing of ofData Data in in Cloud, Procedia Comput. Sci., vol. 171, no. 2019, pp. 22252234, 2020, doi: 10.1016/j.procs.2020.04.240.
[23] R. Miguel, K. Mi, and M. Aung, HEDup: Secure Deduplication with Homomorphic Encryption, pp. 215223, 2020. [24] X. Zheng, Y. Zhou, Y. Ye, and F. Li, A cloud data deduplication scheme based on certificateless proxy re-encryption, J. Syst. Arch., vol. 102, 2020. [25] G. Kan, C. Jin, H. Zhu, Y. Xu, and N. Liu, An identity-based proxy re-encryption for data deduplication in cloud, J. Syst. Archit., vol. 121, no. October, p. 102332, 2021, doi: 10.1016/j.sysarc.2021.102332. [26] T. Youn, N. Jho, K. H. Rhee, and S. U. Shin, Authorized Client- Side Deduplication Using CP-ABE in Cloud Storage, vol. 2019, 2019. [27] Y. Zhou et al., SecDep: A User-Aware Efficient Fine-Grained Secure Deduplication Scheme with Multi-Level Key Management, 2015. doi: 10.1109/MSST.2015.720897.