[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: copying many files



Thanks all! 

Here is some answers to your questions (and sorry for the collective answer):

- I have several genome sequencing project, with million of reads (which makes 
million of files per project). Usually the reads are concatenated into a 
single large file, but some annoying programs want them as individual files. 
I store the raw data in a single folder on /home/lab/bk/name_of_the_project, 
and I run some analysis somewhere else on the same disk 
(actually /home/lab/nbk/name_of_the_project, which is not backed-up). Some 
programs will modify the files, so I need to keep a "clean" copy, and I can't 
just use links. So I guess that on the admin side of the world this would 
somewhat similar to someone who wants to duplicate million of email files 
from one place to another one, but who also wants to do it several times per 
week!

- Being a biologist, I did not know all these hardware issues. Now I realize 
that my hardware is really not well optimized. It's basically 2 sata drives  
(on for /, the other for /home), no raid, with the default ext3 options that 
comes with ubuntu (block size = 4096), 4 cpus and 20GB of RAM. Here is some 
more:

[tristan@babylon ~] lspci
00:00.0 Host bridge: Intel Corporation 5000X Chipset Memory Controller Hub 
(rev 12)
00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 
2 (rev 12)
00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 
3 (rev 12)
00:04.0 PCI bridge: Intel Corporation 5000X Chipset PCI Express x16 Port 4-7 
(rev 12)
00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 
5 (rev 12)
00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 
6 (rev 12)
00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 
7 (rev 12)
00:10.0 Host bridge: Intel Corporation 5000 Series Chipset Error Reporting 
Registers (rev 12)
00:10.1 Host bridge: Intel Corporation 5000 Series Chipset Error Reporting 
Registers (rev 12)
00:10.2 Host bridge: Intel Corporation 5000 Series Chipset Error Reporting 
Registers (rev 12)
00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers 
(rev 12)
00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers 
(rev 12)
00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev 
12)
00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev 
12)
00:1b.0 Audio device: Intel Corporation 631xESB/632xESB High Definition Audio 
Controller (rev 09)
00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset PCI Express 
Root Port 1 (rev 09)
00:1d.0 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI 
USB Controller #1 (rev 09)
00:1d.1 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI 
USB Controller #2 (rev 09)
00:1d.2 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI 
USB Controller #3 (rev 09)
00:1d.3 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI 
USB Controller #4 (rev 09)
00:1d.7 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset EHCI 
USB2 Controller (rev 09)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset LPC 
Interface Controller (rev 09)
00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE Controller (rev 
09)
00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA Storage 
Controller AHCI (rev 09)
00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus Controller 
(rev 09)
01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Upstream 
Port (rev 01)
01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to PCI-X 
Bridge (rev 01)
02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream 
Port E1 (rev 01)
02:01.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream 
Port E2 (rev 01)
05:0b.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068 PCI-X 
Fusion-MPT SAS (rev 01)
07:00.0 VGA compatible controller: nVidia Corporation NV44 [Quadro NVS 285] 
(rev a1)
0b:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5752 Gigabit 
Ethernet PCI Express (rev 02)
0c:0a.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 
Controller (PHY/Link)

[tristan@babylon ~] sudo dumpe2fs /dev/mapper/sdb1 | more
dumpe2fs 1.40-WIP (14-Nov-2006)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          3289b20c-fb62-4158-8b76-edebc9681e16
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal filetype needs_recovery sparse_super 
large_file
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              56360960
Block count:              112721920
Reserved block count:     5636096
Free blocks:              28883202
Free inodes:              55118839
First block:              0
Block size:               4096
Fragment size:            4096
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         16384
Inode blocks per group:   512
Last mount time:          Wed Apr  9 11:31:37 2008
Last write time:          Wed Apr  9 11:31:37 2008
Mount count:              8
Maximum mount count:      30
Last checked:             Fri Jan 18 10:50:10 2008
Check interval:           0 (<none>)
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               128
Journal inode:            8
Journal backup:           inode blocks
Journal size:             128M
[...]

- I guess that my next step will be to get another harddrive, use Reiserfs or 
EXT2 with a small block size... I apparently have a SCSI controller on this 
box, but also have some empty slots for SATA disks, would you recommend an 
SCSI drive?

-Tristan


On Monday 21 April 2008 09:27:24 James D. Marco wrote:
> Hi Tristan,
>         I would suggest you have the wrong hardware, and as Alex
> and Marcel alluded to, an improper configuration for todays larger
> disks.
>         Check the block size on your disks (if not SCSI.) It should
> match what you are selecting as a block size for Linux.  It makes
> little to no difference to select smaller block sizes, they are all
> processed about the same. The time loading up the CPU (performing
> a context swap) then processing the data (usually a disk reference
> calculation) then another unloading on the CPU is where you are
> burning speed; the actual calculation time (in integers) is small...
> 2 CPU cycles usually. Parallel CPU's can often run slower because
> it has to decide which CPU to run the calculation on...8 and 16meg
> disk caches can go just so far in covering this up, then they slow
> down to raw disk speeds. Smaller block sizes only allow more files
> of smaller sizes in return for slightly slower performance.
>         Do not use the same disk for the transfers. Get another disk
> for this. A 64bit SATA or SCSI card (RAID?) will slow access to one
> disk, but drastically improve your throughput...your basic problem.
> Much larger cache (64meg) and often have a built-in co processor.
>         Also, copying from here to there on the same disk is ALWAYS
> hard on the disks. The CPU(or SCSI controller) keeps swapping
> back and forth to keep the disk going. Better with SCSI, though.
> It does NOT tie up the CPU with disk reference calculations.
>         If I had to do this, I would get a second disk, as my first
> upgrade. Second would be to format the disks as EXT2, not journaled. Third
> would be to reset the block size to the disk block size. Forth would be to
> increase memory (as caching.)
>         Fifth would be to opt for a SCSI card (much more expensive!)
>         Sixth would be to apply RAID (just plain vanilla striping...)
>         Then get as many high-speed drives as possible working.
> I think you are running into flooding the cache and forcing the
> machine to fall back to raw disk speeds. Writing always takes more
> time than reading.
>         Thanks, and good luck!
>                 jdm
>
> At 06:37 PM 4/20/2008, Tristan Lefebure wrote:
> >Hi,
> >
> >These days I often have to copy several million of files from one folder
> > to another on the same computer (and usually the same disk), and it takes
> > a while with a regular 'cp' approach (several hours).
> >
> >The files are rather small (~400 Bites), so I think that most of the time
> > is spent creating the files, not copying the data. Would you have a
> > suggestion to speed up the process?
> >
> >I've already tried to create a tar archive, but it also take a while
> > create and extract the archive. Should I use another file system (I use
> > ext3 with ubuntu 7.10).
> >
> >Thanks for any help!
> >--
> >Tristan Lefebure
> >
> >Population Medicine & Diagnostic Sciences
> >College of Veterinary Medicine
> >Cornell University
> >
> >phone: (607) 253 4228
> >
> >http://www.people.cornell.edu/pages/tnl7/
>
> James Marco
> Computer Operations Manager
> Biomedical Engineering & Chemical and Biomolecular Engineering
> Cornell University
> B77 Olin Hall,
> Ithaca,  NY  14853
> Office: 255-7312, Computer Room: 255-0480



-- 
Tristan Lefebure

Population Medicine & Diagnostic Sciences
College of Veterinary Medicine
Cornell University

phone: (607) 253 4228

http://www.people.cornell.edu/pages/tnl7/