Wednesday, April 3, 2019
Solution of a System of Linear Equations for INTELx64
Solution of a System of Linear Equations for INTELx64A multi join hyper- winded solving of a system of running(a) equations for INTELx64 arc nominateectureRicha SinghalABSTRACT. A system of linear equations forms a very fundamental whiz of linear algebra with very wide spread applications involving fields such as physics, chemistry and make up electronics. With systems g wordsing in complexity and demand of incessantly increasing clearcutness for settlements it becomes the need of the hour to have methodologies which backside work on a expectant system of such equations to accuracy with fastest mathematical work. On the other open as frequency scaling is becoming passing factor to f either upon carrying into action improvement of processors modern architectures are deploying multi content progression with attributes kindred hyper threading to meet performance extremitys. The paper targets solving a system of linear equations for a multi perfume INTELx64 arch itecture with hyper threading use standard LU rot methodology. This paper similarly presents a Forward try out LU decay approach which gives better performance by effectively utilizing L1 lay away of from all(prenominal) unrivaled processor in the multi core architecture. The sample uses as stimulant drug a ground substance of 40004000 repeat precision floating point representation of the system.1. installationA system of linear equations is a collection of linear equations of uniform uncertain. A system of linear equations forms a very fundamental principal of linear algebra with very wide spread applications involving fields such as physics, chemistry and even electronics. With systems growing in complexity and demand of ever increasing precision for results it becomes the need of the hour to have methodologies which peck enlighten a large system of such equations to accuracy with fastest performance. On the other hand as frequency scaling is becoming limiting fac tor to achieve performance improvement. With increasing clock frequency the goernment agency consumption goes upP = C x V2 x FP is post consumptionV is voltageF is frequencyIt was because of this factor hardly that INTEL had to cancel its Tejas and Jayhawk processors. A newer approach is to deploy dual cores which are capable to pair process mutually exclusive tasks of a job to achieve the requisite performance improvement. Hyper threading is another method which makes a single core appears as two by using around loanitional registers. Having said that it requires that traditional algorithms which are resultant in nature to be reworked and factorized so that they can efficiently utilize the processing force out offered by these architectures.This paper aims to stand an implementation for standard LU bunkum method used to solve system of linear equations adopting a onwards seek methodology to efficiently solve a system of double precision system of linear equations with 4000 variable set. The proposed solution call ines all aspects of riddle solving starting from file I/O to read the input system of equations to really solving the system to generate required issue using multi core techniques. The solution assumes that the input problem has one and only one unique solution possible.2. CHALLENGESThe primary challenge is to rework the sequential LU decomposition method so that the revised framework can be decomposed into a set of independent problems which can be work independently as far as possible. Then use this LU decomposition output and apply standard techniques of before and backward exchange severally again using multi core techniques to reach the final output. some other challenge associated is hive up management. Since a set of 4000 floating point variable forget take a repositing approximately 32KB of memory and at that place will 4000 different equations put up together, hence efficiently managing all data in cache becomes a ch allenge. A forward seek methodology was used in LU decomposition which tries to keep the relevant data at L1 cache before it is required to be touch. It also tries to maximise trading operations on set of data once it is in cache so that cache misses are minimum.3. IMPACTWith a 40 core INTEXx64 machine with hyper threading the proposed method could achieve an acceleration of 72X in performance as compared to a standard sequential implementation.4. STATE OF THE ARTThe proposed solution uses state of the art programming techniques available for multithreaded architecture. It also uses INTEX ADVANCED transmitter SET (AVX) intrinsic instruction set to achieve maximum hyper threading. natural POSIX threads were used for the purpose. Efficient disk IO was made possible by mapping input vector file to RAM directly using mmap.5. PROPOSED SOLUTIONA system of linear equations representing CURRENT / VOLTAGE affinity for a set of resistances is defined asRI = VSteps to solve this can be ill ustrated asDecompose R into L and USolve LZ = V for ZSolve UI = Z for IResistance ground substance is modelled as an crop 40004000 of double precision floating type elements. The memory address being 16 byte aligned so that RAM entrance money speeds up for read and write operations.FLOAT RESMATRIX_SIZE*MATRIX_SIZE__attribute__((aligned(0x1000)))Voltage matrix is modelled as an array 40001 of double precision floating type elements. The memory address being 16 byte aligned so that RAM access speeds up for read and write operations.FLOAT V MATRIX_SIZE _attribute__ ((aligned(0x1000)))LU DecompositionTo solve the basic model of parallel LU decomposition as suggested to a higher place was adopted. Here as we move along the diagonal of the main matrix we calculate the factor values for Lower triangular matrix. Simultaneously each row operation updates elements for upper triangular matrix.Basic routine to do row operationThis routine is the innermost level routine which updates the r ows which will eventually determine the upper triangular matrix.For each element of row there is one subtraction and one multiplication operation (highlighted). loop topology B designates row major operation, while LOOP A designates tower major operation.Basic AlgorithmSUB LUDECOM (A, N)DO K = 1, n 1DO I = K+1, NAi, k = Ai, k / Ak, jDO j = K + 1, NAi, j = Ai, j Ai, k * Ak, j balance DO halt DO arrest DOEND LUDECOMEach row major operation (LOOP B) loop-the-loop can be independently executed on a offprint core. This was achieved by using POSIX threads which were non-blocking in nature. Because of mutual exclusion over the set of data MUTEX locks are not required provided we keep the column major operation (LOOP A) sequential.Also for 2 consecutive elements in one row operation 2 subtraction and 2 multiplication operations are done. These 2 operations each are done in single step using Single Instruction Multiple info instructions (Hyper threading)Multi core AlgorithmSUB LUDECOM_ BLOCK (A, K, BLOCK_START, BLOCK_END)DO I = BLOCK_START, BLOCK_ENDAi, K = Ai, K / AK, KDO j = K + 1, NAi, j = Ai, j Ai, K * Ak, KEND DOEND DOEND LUDECOM_BLOCKSUB LUDECOM (A, N)DO K = 1, N 1 BLOCK_SIZE = (N K) / MAX_THREADS get = 0 period (Thread P_THREAD (LUDECOMPOSITION_BLOCK (A,K,Thread*BLOCK_SIZE,Thread*(BLOCK_SIZE + 1))ENDWHILEEND DOEND LUDECOMForward electric switchOnce LU decomposition is done, forward substitution gives matrix Z. Here again Single Instruction Multiple information instructions are usedLZ = V for ZBackward substitutionAfter forward substitution final step of backward substitution gives current matrix IUI = Z for IHere again Single Instruction Multiple Data instructions are used5. CACHE IMPROVEMENTSOn profiling it is notice that the core processing in above solution happens to be LU decomposition. nonetheless if we create threads equal in go to available cores the result was improving but not in alike proportion to the depend of cores. A VALGRIND analys is of cache performance reveals that because of large size of matrix each row operation was suffering a performance hit due to cache misses happening.If we observe above solution it could be observed any jth is processed for (j 1) columns. So (j 1) threads are forked for each iteration of column major operation (LOOP A). The data to be processed refers to same memory location but by the time nigh operation or thread is forked for the same row the tally memory data had been pushed out of lower level caches. Thus cache miss happens.To solve this we adopted a forward seek approach wherein we first pre-process a set of columns sequentially thus enabling more(prenominal) operations on a row to be performed in the same thread. Now the data happens to be at lower level cache as we do not have to wait for another thread to process the same row.Multi core Algorithm with forward seek operationSUB LUDECOM_BLOCK_SEEK (A, K, S, BLOCK_START, BLOCK_END)DO I = BLOCK_START, BLOCK_ENDDO U = 1, S M = K + U -1Ai, M = Ai, M / AM, jDO j = K + M + 1, NAi, j = Ai, j Ai, M * AK, MEND DOEND DOEND DOEND LUDECOM_BLOCKSUB LUDECOM (A, N)K = 1WHILE (K //Forward seekDO J = K, K + F_SEEKLU_DECOM_BLOCK_SEEK (A, J, 0, J, J+F_SEEK)END DO//Multi coreK = K + F_SEEKDO L = 1, N 1 BLOCK_SIZE = (N L) / MAX_THREADSThread = 0WHILE (Thread P_THREAD (LUDECOMPOSITION_BLOCK (A,L,F_SEEK,Thread*BLOCK_SIZE,Thread*(BLOCK_SIZE + 1))ENDWHILEEND DOEND WHILEEND LUDECOMCONCLUSIONResultsFor purpose of computation a sample array of double precision floating point matrix of size 40004000 was taken. surgical process numbers were generated on an 8 core INTEL architecture machine.TABLE 4.iA programmer that writes underlyingly parallel code does not need to bear on about task division or process communication, focusing kind of in the problem that his or her program is intended to solve. Implicit proportionateness generally facilitates the design of parallel programs and therefore results in a corporeal improve ment of programmer productivity.Many of the constructs necessary to support this also add simplicity or clarity even in the absence of actual parallelism. The example above, of List comprehension in the sin() function, is a useful feature in of itself. By using implicit parallelism, languages effectively have to provide such useful constructs to users simply to support required functionality (a language without a decent for loop, for example, is one few programmers will use).Languages with implicit parallelism curtail the control that the programmer has over the parallel execution of the program, resulting sometimes in less-than-optimal solution The makers of the Oz programming language also note that their early experiments with implicit parallelism showed that implicit parallelism made debugging difficult and object models unnecessarily awkward.2A larger issue is that every program has some parallel and some nonparallel logic. Binary I/O, for example, requires support for such s erial operations as Write() and Seek(). If implicit parallelism is desired, this creates a new requirement for constructs and keywords to support code that cannot be threaded or distributed.REFERENCESGottlieb, Allan Almasi, George S. (1989).Highly parallel computing. redwood City, Calif. Benjamin/Cummings.ISBN0-8053-0177-1.S.V. Adve et al. (November 2008).Parallel Computing Research at Illinois The UPCRC Agenda(PDF). emailprotected, University of Illinois at Urbana-Champaign. The main techniques for these performance benefits increased clock frequency and smarter but increasingly complex architectures are now hitting the so-called power wall. The computer industry has trustworthy that future performance increases must largely come from increasing the number of processors (or cores) on a die, rather than making a single core go faster.Asanovic et al. Old conventional wisdom Power is free, but transistors are expensive. New conventional wisdom is that power is expensive, but transis tors are freeBunch, James R.Hopcroft, John(1974), Triangular factorization and everting by fast matrix multiplication,Mathematics of Computation28 231236,doi10.2307/2005828,ISSN0025-5718.Cormen, Thomas H.Leiserson, Charles E.Rivest, Ronald L.Stein, Clifford(2001),Introduction to Algorithms, MIT Press and McGraw-Hill,ISBN978-0-262-03293-3.Golub, Gene H. vanguard Loan, Charles F.(1996),Matrix Computations(3rd ed.), Baltimore Johns Hopkins,ISBN978-0-8018-5414-9.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.