admin管理员组

文章数量:1122832

I have a Git server service implemented based on go-git, and the repository has multiple copies on different machines. When a machine goes down and is replaced, I perform a full clone of the repository to generate a new copy, and I want to accelerate this process. As the history grows, the full clone takes several hours, especially as the client spends a significant amount of time generating the index from the received pack.

I think there are two ways to speed up the copying phase:

  1. Client generates the index from the received pack. I wonder if it's possible to transmit both the pack and the index to the client at the same time, so it saves the time the client takes to compute the index. However, I realize that Git's transfer protocols are all based on packs and do not include the index, so I'm not sure if this approach would present any issues.

  2. Truncating history would reduce the size of the repository, making the full clone faster. However, since I am a Git server, it is necessary for the server to maintain the integrity of the repository. Shallow clones and partial clones are optimizations aimed at clients and are not feasible on the server side. Operations like filter-repo rewrite history, which I am not comfortable accepting, because some of the newer versions are already being relied upon by other systems.

I have been looking for a way to truncate history without rewriting the commit IDs, and it’s frustrating that I have yet to find one. For instance, git-replace only gives the illusion that the history is hidden, as the old historical objects cannot be cleaned up by git gc. Is there really no way out, am I just supposed to watch the Git repository grow larger and larger?

I have a Git server service implemented based on go-git, and the repository has multiple copies on different machines. When a machine goes down and is replaced, I perform a full clone of the repository to generate a new copy, and I want to accelerate this process. As the history grows, the full clone takes several hours, especially as the client spends a significant amount of time generating the index from the received pack.

I think there are two ways to speed up the copying phase:

  1. Client generates the index from the received pack. I wonder if it's possible to transmit both the pack and the index to the client at the same time, so it saves the time the client takes to compute the index. However, I realize that Git's transfer protocols are all based on packs and do not include the index, so I'm not sure if this approach would present any issues.

  2. Truncating history would reduce the size of the repository, making the full clone faster. However, since I am a Git server, it is necessary for the server to maintain the integrity of the repository. Shallow clones and partial clones are optimizations aimed at clients and are not feasible on the server side. Operations like filter-repo rewrite history, which I am not comfortable accepting, because some of the newer versions are already being relied upon by other systems.

I have been looking for a way to truncate history without rewriting the commit IDs, and it’s frustrating that I have yet to find one. For instance, git-replace only gives the illusion that the history is hidden, as the old historical objects cannot be cleaned up by git gc. Is there really no way out, am I just supposed to watch the Git repository grow larger and larger?

Share Improve this question edited yesterday jonrsharpe 122k30 gold badges263 silver badges470 bronze badges asked yesterday someonesomeone 11 silver badge New contributor someone is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct. 5
  • 4 You are trying to solve the unsolvable... modifying history (by rewriting, if you truncate history) mandates creating new commit objects which will unavoidably have new commit IDs. – eftshift0 Commented yesterday
  • If you found a way to do it, then it would be jackpot as you would have found a way to hack into the security of repos... the model based on sha1 hashes (originally, don't know how migration to newer hashes has gone) as git does is built to forbid this from being possible for security reasons. – eftshift0 Commented yesterday
  • May be you simply want shallow cloning with sparse checkout? – phd Commented yesterday
  • @eftshift0 A shallow clone can truncate history; however, as a Git server, I feel that I shouldn't maintain a repository that contains shallow files. As mentioned in github.blog/open-source/git/… , what are your thoughts on remote shallow repositories? – someone Commented 17 hours ago
  • ? Shallow clones do not rewrite history. You have a single commit object that points to a commit that you are missing. – eftshift0 Commented 15 hours ago
Add a comment  | 

1 Answer 1

Reset to default 1

However, I realize that Git's transfer protocols are all based on packs and do not include the index

You're not limited to using Git's transfer protocols for the initial clone. You can wget a git bundle and unpack it locally, then update the 'remote' URL (current git versions have this workflow integrated as git clone --bundle-uri=). You can rsync a "template" clone, or you can do a local git init and rsync just the .git/objects directory (plus fetching refs, plus minor tidying).

am I just supposed to watch the Git repository grow larger and larger?

Generally, yes.

It is not quite meant for repositories of that size (the original "goal", linux.git, is currently at 10 GB whereas it sounds from your description that yours is well above 100 GB), and specifically not meant for repositories that grow as rapidly as yours – if you're using Git as a method to sync build outputs or "Linux ISOs" or other binary data, then it's not the right tool. (git-annex, casync, Syncthing, or Ceph might be more appropriate.)

I have been looking for a way to truncate history without rewriting the commit IDs, and it’s frustrating that I have yet to find one. For instance, git-replace only gives the illusion that the history is hidden

It is possible to clone from a shallow repository. If the original repository has a .git/shallow created – which tells Git to pretend that a specific commit has no parents without rewriting that commit outright – this will be honored by git clone, automatically making a shallow repository in turn. So if you create an initial "template" shallow clone of the original, you'll have a repo that only contains the trimmed data, without having to manually git-gc the rest – and your machines can clone from this template.

(While the more general 'grafts' feature has been superseded by git-replace, which does not transfer through clones, the more-specific 'shallow' feature deliberately uses the original implementation.)

There is no other method that would allow this and would be carried over a git clone. Truly truncating a history at a certain commit would literally involve rewriting that commit, to remove its parent XxXxX... line – which naturally changes its "commit ID", as the IDs are hashes of the whole commit including all of its metadata (which is part of what ensures the integrity of a repository). This means that the subsequent commits now need to be changed to reference the updated parent, so their IDs change, and so on, and so on. It has been a deliberate part of Git's storage design.

本文标签: libgit2How to truncate git commit history to reduce object countStack Overflow