[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: copying many files
- From: David Young <
>
- To: cslug-l <
>
- Subject: Re: copying many files
- Date: Mon, 21 Apr 2008 12:22:41 -0500
On Mon, Apr 21, 2008 at 10:27:08AM -0400, Tristan Lefebure wrote:
> Thanks all!
>
> Here is some answers to your questions (and sorry for the collective answer):
>
> - I have several genome sequencing project, with million of reads (which makes
> million of files per project). Usually the reads are concatenated into a
> single large file, but some annoying programs want them as individual files.
> I store the raw data in a single folder on /home/lab/bk/name_of_the_project,
> and I run some analysis somewhere else on the same disk
> (actually /home/lab/nbk/name_of_the_project, which is not backed-up). Some
> programs will modify the files, so I need to keep a "clean" copy, and I can't
> just use links. So I guess that on the admin side of the world this would
> somewhat similar to someone who wants to duplicate million of email files
> from one place to another one, but who also wants to do it several times per
> week!
Tristan,
I don't think that you will necessarily improve performance by changing
your hardware configuration, especially given your equipment (20GB
RAM!!) and the difficult task you ask for the system to do. I have some
questions for you to answer, that may produce the evidence you need in
order to improve performance by a lot, without changing your hardware:
1 When you deactivate journalling, doesn't performance improve a lot?
2 How much RAM is your system using to cache filesystem data blocks,
directory entries and other metadata?
3 What is the system's policy for discarding items from the filesystem
cache?
3 Does your filesystem store directory entries in an unordered array,
in sorted order, or in some other data structure?
4 Are your system CPUs idle most of the time during the copy? What state
is cp ordinarily in? Waiting for disk I/O?
5 What is the disk scheduler's policy? C-SCAN? First-come, first-served?
6 How long does the disk scheduler's work queue grow during the copy?
I suspect that your system unnecessarily spends most of its time
waiting for drive seeks to complete because of so-called "deceptive
idleness," <http://citeseer.ist.psu.edu/452277.html>.
After you have disabled journalling, see if you cannot improve performance
by running several instances of cp that each copy a different subset
of your files. For example, this command kicks off several concurrent
copies:
% cp src/a* dst/. & cp src/b* dst/. & cp src/y* dst/. & cp src/z* dst/. &
That will help the disk scheduler to schedule the cp processes (good),
instead of the cp processes scheduling the disk accesses (bad).
Dave
--
David Young OJC Technologies
Urbana, IL * (217) 278-3933 ext 24