[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: copying many files



On Mon, Apr 21, 2008 at 10:27:08AM -0400, Tristan Lefebure wrote:
> Thanks all! 
> 
> Here is some answers to your questions (and sorry for the collective answer):
> 
> - I have several genome sequencing project, with million of reads (which makes 
> million of files per project). Usually the reads are concatenated into a 
> single large file, but some annoying programs want them as individual files. 
> I store the raw data in a single folder on /home/lab/bk/name_of_the_project, 
> and I run some analysis somewhere else on the same disk 
> (actually /home/lab/nbk/name_of_the_project, which is not backed-up). Some 
> programs will modify the files, so I need to keep a "clean" copy, and I can't 
> just use links. So I guess that on the admin side of the world this would 
> somewhat similar to someone who wants to duplicate million of email files 
> from one place to another one, but who also wants to do it several times per 
> week!

Tristan,

I don't think that you will necessarily improve performance by changing
your hardware configuration, especially given your equipment (20GB
RAM!!) and the difficult task you ask for the system to do.  I have some
questions for you to answer, that may produce the evidence you need in
order to improve performance by a lot, without changing your hardware:

1 When you deactivate journalling, doesn't performance improve a lot?

2 How much RAM is your system using to cache filesystem data blocks,
  directory entries and other metadata?

3 What is the system's policy for discarding items from the filesystem
  cache?

3 Does your filesystem store directory entries in an unordered array,
  in sorted order, or in some other data structure?

4 Are your system CPUs idle most of the time during the copy?  What state
  is cp ordinarily in?  Waiting for disk I/O?

5 What is the disk scheduler's policy?  C-SCAN?  First-come, first-served?

6 How long does the disk scheduler's work queue grow during the copy?

I suspect that your system unnecessarily spends most of its time
waiting for drive seeks to complete because of so-called "deceptive
idleness," <http://citeseer.ist.psu.edu/452277.html>.

After you have disabled journalling, see if you cannot improve performance
by running several instances of cp that each copy a different subset
of your files.  For example, this command kicks off several concurrent
copies:

% cp src/a* dst/. & cp src/b* dst/. & cp src/y* dst/. & cp src/z* dst/. &

That will help the disk scheduler to schedule the cp processes (good),
instead of the cp processes scheduling the disk accesses (bad).

Dave

-- 
David Young             OJC Technologies
protected address      Urbana, IL * (217) 278-3933 ext 24