[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: copying many files



Tristan,
        If you had unlimited funds, go with the mid ground SCSI/SATA
for your applications. As Oracle found out many years ago:
                Fast File Systems ARE Fast Databases. 
        One amendment to the previous note, if you are doing a LOT
of lookups/queries, a Mirrored configuration is a bit faster.
        SAS(SCSI, sort'a) or SATA shows throughput of ~3g/sec. These 
are expensive (Card and 8 drives.) Generally, these are optimized for 
Servers and are wasted with anything less than two 1/g network cards. 
Sounds like something you could use...but... These have a slightly longer
latency, but excellent throughput.  
        SATA 1.5/gb and fully populated with 6 drives in a striped
configuration. A second best to above...         
        Older Ultra 320's and SATA 150's in a striped configuration. 
You may need a controller card, but it may well be built in..."most'est for
the least'est." These will really pump data over a single connection, and,
are still fairly snappy used directly. 
        Have you looked into solid state drives? I think Dell is offering
these, but, I have no knowledge of reliability, cost, or performance. Still
checking these, but they appear a bit slow in todays configuration.
        Good Luck!
                jdm
At 10:27 AM 4/21/2008, Tristan Lefebure wrote:
>Thanks all! 
>
>Here is some answers to your questions (and sorry for the collective answer):
>
>- I have several genome sequencing project, with million of reads (which makes 
>million of files per project). Usually the reads are concatenated into a 
>single large file, but some annoying programs want them as individual files. 
>I store the raw data in a single folder on /home/lab/bk/name_of_the_project, 
>and I run some analysis somewhere else on the same disk 
>(actually /home/lab/nbk/name_of_the_project, which is not backed-up). Some 
>programs will modify the files, so I need to keep a "clean" copy, and I can't 
>just use links. So I guess that on the admin side of the world this would 
>somewhat similar to someone who wants to duplicate million of email files 
>from one place to another one, but who also wants to do it several times per 
>week!
>
>- Being a biologist, I did not know all these hardware issues. Now I realize 
>that my hardware is really not well optimized. It's basically 2 sata drives  
>(on for /, the other for /home), no raid, with the default ext3 options that 
>comes with ubuntu (block size = 4096), 4 cpus and 20GB of RAM. Here is some 
>more:
>
>[tristan@babylon ~] lspci
>00:00.0 Host bridge: Intel Corporation 5000X Chipset Memory Controller Hub 
>(rev 12)
>00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 
>2 (rev 12)
>00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 
>3 (rev 12)
>00:04.0 PCI bridge: Intel Corporation 5000X Chipset PCI Express x16 Port 4-7 
>(rev 12)
>00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 
>5 (rev 12)
>00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 
>6 (rev 12)
>00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 
>7 (rev 12)
>00:10.0 Host bridge: Intel Corporation 5000 Series Chipset Error Reporting 
>Registers (rev 12)
>00:10.1 Host bridge: Intel Corporation 5000 Series Chipset Error Reporting 
>Registers (rev 12)
>00:10.2 Host bridge: Intel Corporation 5000 Series Chipset Error Reporting 
>Registers (rev 12)
>00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers 
>(rev 12)
>00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers 
>(rev 12)
>00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev 
>12)
>00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev 
>12)
>00:1b.0 Audio device: Intel Corporation 631xESB/632xESB High Definition Audio 
>Controller (rev 09)
>00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset PCI Express 
>Root Port 1 (rev 09)
>00:1d.0 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI 
>USB Controller #1 (rev 09)
>00:1d.1 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI 
>USB Controller #2 (rev 09)
>00:1d.2 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI 
>USB Controller #3 (rev 09)
>00:1d.3 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI 
>USB Controller #4 (rev 09)
>00:1d.7 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset EHCI 
>USB2 Controller (rev 09)
>00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
>00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset LPC 
>Interface Controller (rev 09)
>00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE Controller (rev 
>09)
>00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA Storage 
>Controller AHCI (rev 09)
>00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus Controller 
>(rev 09)
>01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Upstream 
>Port (rev 01)
>01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to PCI-X 
>Bridge (rev 01)
>02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream 
>Port E1 (rev 01)
>02:01.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream 
>Port E2 (rev 01)
>05:0b.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068 PCI-X 
>Fusion-MPT SAS (rev 01)
>07:00.0 VGA compatible controller: nVidia Corporation NV44 [Quadro NVS 285] 
>(rev a1)
>0b:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5752 Gigabit 
>Ethernet PCI Express (rev 02)
>0c:0a.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 
>Controller (PHY/Link)
>
>[tristan@babylon ~] sudo dumpe2fs /dev/mapper/sdb1 | more
>dumpe2fs 1.40-WIP (14-Nov-2006)
>Filesystem volume name:   <none>
>Last mounted on:          <not available>
>Filesystem UUID:          3289b20c-fb62-4158-8b76-edebc9681e16
>Filesystem magic number:  0xEF53
>Filesystem revision #:    1 (dynamic)
>Filesystem features:      has_journal filetype needs_recovery sparse_super 
>large_file
>Default mount options:    (none)
>Filesystem state:         clean
>Errors behavior:          Continue
>Filesystem OS type:       Linux
>Inode count:              56360960
>Block count:              112721920
>Reserved block count:     5636096
>Free blocks:              28883202
>Free inodes:              55118839
>First block:              0
>Block size:               4096
>Fragment size:            4096
>Blocks per group:         32768
>Fragments per group:      32768
>Inodes per group:         16384
>Inode blocks per group:   512
>Last mount time:          Wed Apr  9 11:31:37 2008
>Last write time:          Wed Apr  9 11:31:37 2008
>Mount count:              8
>Maximum mount count:      30
>Last checked:             Fri Jan 18 10:50:10 2008
>Check interval:           0 (<none>)
>Reserved blocks uid:      0 (user root)
>Reserved blocks gid:      0 (group root)
>First inode:              11
>Inode size:               128
>Journal inode:            8
>Journal backup:           inode blocks
>Journal size:             128M
>[...]
>
>- I guess that my next step will be to get another harddrive, use Reiserfs or 
>EXT2 with a small block size... I apparently have a SCSI controller on this 
>box, but also have some empty slots for SATA disks, would you recommend an 
>SCSI drive?
>
>-Tristan
>
>
>On Monday 21 April 2008 09:27:24 James D. Marco wrote:
>> Hi Tristan,
>>         I would suggest you have the wrong hardware, and as Alex
>> and Marcel alluded to, an improper configuration for todays larger
>> disks.
>>         Check the block size on your disks (if not SCSI.) It should
>> match what you are selecting as a block size for Linux.  It makes
>> little to no difference to select smaller block sizes, they are all
>> processed about the same. The time loading up the CPU (performing
>> a context swap) then processing the data (usually a disk reference
>> calculation) then another unloading on the CPU is where you are
>> burning speed; the actual calculation time (in integers) is small...
>> 2 CPU cycles usually. Parallel CPU's can often run slower because
>> it has to decide which CPU to run the calculation on...8 and 16meg
>> disk caches can go just so far in covering this up, then they slow
>> down to raw disk speeds. Smaller block sizes only allow more files
>> of smaller sizes in return for slightly slower performance.
>>         Do not use the same disk for the transfers. Get another disk
>> for this. A 64bit SATA or SCSI card (RAID?) will slow access to one
>> disk, but drastically improve your throughput...your basic problem.
>> Much larger cache (64meg) and often have a built-in co processor.
>>         Also, copying from here to there on the same disk is ALWAYS
>> hard on the disks. The CPU(or SCSI controller) keeps swapping
>> back and forth to keep the disk going. Better with SCSI, though.
>> It does NOT tie up the CPU with disk reference calculations.
>>         If I had to do this, I would get a second disk, as my first
>> upgrade. Second would be to format the disks as EXT2, not journaled. Third
>> would be to reset the block size to the disk block size. Forth would be to
>> increase memory (as caching.)
>>         Fifth would be to opt for a SCSI card (much more expensive!)
>>         Sixth would be to apply RAID (just plain vanilla striping...)
>>         Then get as many high-speed drives as possible working.
>> I think you are running into flooding the cache and forcing the
>> machine to fall back to raw disk speeds. Writing always takes more
>> time than reading.
>>         Thanks, and good luck!
>>                 jdm
>>
>> At 06:37 PM 4/20/2008, Tristan Lefebure wrote:
>> >Hi,
>> >
>> >These days I often have to copy several million of files from one folder
>> > to another on the same computer (and usually the same disk), and it takes
>> > a while with a regular 'cp' approach (several hours).
>> >
>> >The files are rather small (~400 Bites), so I think that most of the time
>> > is spent creating the files, not copying the data. Would you have a
>> > suggestion to speed up the process?
>> >
>> >I've already tried to create a tar archive, but it also take a while
>> > create and extract the archive. Should I use another file system (I use
>> > ext3 with ubuntu 7.10).
>> >
>> >Thanks for any help!
>> >--
>> >Tristan Lefebure
>> >
>> >Population Medicine & Diagnostic Sciences
>> >College of Veterinary Medicine
>> >Cornell University
>> >
>> >phone: (607) 253 4228
>> >
>> >http://www.people.cornell.edu/pages/tnl7/
>>
>> James Marco
>> Computer Operations Manager
>> Biomedical Engineering & Chemical and Biomolecular Engineering
>> Cornell University
>> B77 Olin Hall,
>> Ithaca,  NY  14853
>> Office: 255-7312, Computer Room: 255-0480
>
>
>
>-- 
>Tristan Lefebure
>
>Population Medicine & Diagnostic Sciences
>College of Veterinary Medicine
>Cornell University
>
>phone: (607) 253 4228
>
>http://www.people.cornell.edu/pages/tnl7/

James Marco
Computer Operations Manager
Biomedical Engineering & Chemical and Biomolecular Engineering
Cornell University
B77 Olin Hall,
Ithaca,  NY  14853
Office: 255-7312, Computer Room: 255-0480